aethos package

Data Module

class aethos.analysis.Analysis(x_train, x_test=None, target='')

Bases: aethos.visualizations.visualizations.Visualizations, aethos.stats.stats.Stats

Core class thats run analytical techniques.

Parameters:
  • x_train (pd.DataFrame) – Training data or aethos data object
  • x_test (pd.DataFrame) – Test data, by default None
  • target (str) – For supervised learning problems, the name of the column you’re trying to predict.
autoviz(max_rows=150000, max_cols=30, verbose=0)

Auto visualizes and analyzes your data to help explore your data.

Credits go to AutoViMl - https://github.com/AutoViML/AutoViz

Parameters:
  • max_rows (int, optional) – Max rows to analyze, by default 150000
  • max_cols (int, optional) – Max columns to analyze, by default 30
  • verbose ({0, 1, 2}, optional) –

    0 - it does not print any messages and goes into silent mode 1 - print messages on the terminal and also display

    charts on terminal
    2 - it will print messages but will not display charts,
    it will simply save them.
checklist()

Displays a checklist dashboard with reminders for a Data Science project.

Examples

>>> data.checklist()
column_info(dataset='train')

Describes your columns using the DataFrameSummary library with basic descriptive info.

Credits go to @mouradmourafiq for his pandas-summary library.

counts uniques missing missing_perc types

Parameters:dataset (str, optional) – Type of dataset to describe. Can either be train or test. If you are using the full dataset it will automatically describe your full dataset no matter the input, by default ‘train’
Returns:Dataframe describing your columns with basic descriptive info
Return type:DataFrame

Examples

>>> data.column_info()
columns

Property to return columns in the dataset.

copy()

Returns deep copy of object.

Returns:Deep copy of object
Return type:Object
correlation_matrix(data_labels=False, hide_mirror=False, output_file='', **kwargs)

Plots a correlation matrix of all the numerical variables.

For more information on possible kwargs please see: https://seaborn.pydata.org/generated/seaborn.heatmap.html

Parameters:
  • data_labels (bool, optional) – True to display the correlation values in the plot, by default False
  • hide_mirror (bool, optional) – Whether to display the mirroring half of the correlation plot, by default False
  • output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)

Examples

>>> data.correlation_matrix(data_labels=True)
>>> data.correlation_matrix(data_labels=True, output_file='corr.png')
data_report(title='Profile Report', output_file='', suppress=False)

Generates a full Exploratory Data Analysis report using Pandas Profiling.

Credits: https://github.com/pandas-profiling/pandas-profiling

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values matrix, count, heatmap and dendrogram of missing values
Parameters:
  • title (str, optional) – Title of the report, by default ‘Profile Report’
  • output_file (str, optional) – File name of the output file for the report, by default ‘’
  • suppress (bool, optional) – True if you do not want to display the report, by default False
Returns:

Return type:

HTML display of Exploratory Data Analysis report

Examples

>>> data.data_report()
>>> data.data_report(title='Titanic EDA', output_file='titanic.html')
describe(dataset='train')

Describes your dataset using the DataFrameSummary library with basic descriptive info. Extends the DataFrame.describe() method to give more info.

Credits go to @mouradmourafiq for his pandas-summary library.

Parameters:dataset (str, optional) – Type of dataset to describe. Can either be train or test. If you are using the full dataset it will automatically describe your full dataset no matter the input, by default ‘train’
Returns:Dataframe describing your dataset with basic descriptive info
Return type:DataFrame

Examples

>>> data.describe()
describe_column(column, dataset='train')

Analyzes a column and reports descriptive statistics about the columns.

Credits go to @mouradmourafiq for his pandas-summary library.

std max min variance mean mode 5% 25% 50% 75% 95% iqr kurtosis skewness sum mad cv zeros_num zeros_perc deviating_of_mean deviating_of_mean_perc deviating_of_median deviating_of_median_perc top_correlations counts uniques missing missing_perc types

Parameters:
  • column (str) – Column in your dataset you want to analze.
  • dataset (str, optional) – Type of dataset to describe. Can either be train or test. If you are using the full dataset it will automatically describe your full dataset no matter the input, by default ‘train’
Returns:

Dictionary mapping a statistic and its value for a specific column

Return type:

dict

Examples

>>> data.describe_column('col1')
drop(*drop_columns, keep=[], regexp='', reason='')

Drops columns from the dataframe.

Parameters:
  • keep (list: optional) – List of columns to not drop, by default []
  • regexp (str, optional) – Regular Expression of columns to drop, by default ‘’
  • reason (str, optional) – Reasoning for dropping columns, by default ‘’
  • names must be provided as strings that exist in the data. (Column) –
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.drop('A', 'B', reason="Columns were unimportant")
>>> data.drop('col1', keep=['col2'], regexp=r"col*") # Drop all columns that start with "col" except column 2
>>> data.drop(keep=['A']) # Drop all columns except column 'A'
>>> data.drop(regexp=r'col*') # Drop all columns that start with 'col'
encode_target()

Encodes target variables with value between 0 and n_classes-1.

Running this function will automatically set the corresponding mapping for the target variable mapping number to the original value.

Note that this will not work if your test data will have labels that your train data does not.

Returns:Returns a deep copy of the Data object.
Return type:Data

Examples

>>> data.encode_target()
expand_json_column(col)

Utility function that expands a column that has JSON elements into columns, where each JSON key is a column.

Parameters:cols (str) – Column in the data that has the nested data.
Returns:Returns a deep copy of the Data object.
Return type:Data

Examples

>>> data.expand_json_column('col1')
features

Features for modelling

interpret_data(show=True)

Interpret your data using MSFT Interpret dashboard.

missing_values

Property function that shows how many values are missing in each column.

predictive_power(col=None, data_labels=False, hide_mirror=False, output_file='', **kwargs)

Calculated the Predictive Power Score of each feature.

If a column is provided, it will calculate it in regards to the target variable.

Credits go to Florian Wetschorek - https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598

Parameters:
  • col (str) – Column in the dataframe
  • data_labels (bool, optional) – True to display the correlation values in the plot, by default False
  • hide_mirror (bool, optional) – Whether to display the mirroring half of the correlation plot, by default False
  • output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)

Examples

>>> data.predictive_power(data_labels=True)
>>> data.predictive_power(col='col1')
standardize_column_names()

Utility function that standardizes all column names to lowercase and underscores for spaces.

Returns:Returns a deep copy of the Data object.
Return type:Data

Examples

>>> data.standardize_column_names()
to_csv(name: str, index=False, **kwargs)

Write data to csv with the name and path provided.

The function will automatically add ‘.csv’ to the end of the name.

By default it writes 10000 rows at a time to file to consider memory on different machines.

Training data will end in ‘_train.csv’ andt test data will end in ‘_test.csv’.

For a full list of keyword args for writing to csv please see the following link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

Parameters:
  • name (str) – File path
  • index (bool, optional) – True to write ‘index’ column, by default False

Examples

>>> data.to_csv('titanic')
to_df()

Return Dataframes for x_train and x_test if it exists.

Returns:Transformed dataframe with rows with a missing values in a specific column are missing

Returns 2 Dataframes test if x_test is provided.

Return type:Dataframe, *Dataframe

Examples

>>> data.to_df()
y_test

Property function for the testing predictor variable

y_train

Property function for the training predictor variable

class aethos.visualizations.visualizations.Visualizations

Bases: object

barplot(x: str, y=None, method=None, asc=None, orient='v', title='', output_file='', **barplot_kwargs)

Plots a bar plot for the given columns provided using Plotly.

If groupby is provided, method must be provided for example you may want to plot Age against survival rate, so you would want to groupby Age and then find the mean as the method.

For a list of group by methods please checkout the following pandas link: https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#computations-descriptive-stats

For a list of possible arguments for the bar plot please checkout the following links: https://plot.ly/python-api-reference/generated/plotly.express.bar.html

Parameters:
  • x (str) – Column name for the x axis.
  • y (str, optional) – Column(s) you would like to see plotted against the x_col
  • method (str) – Method to aggregate groupy data Examples: min, max, mean, etc., optional by default None
  • asc (bool) – To sort values in ascending order, False for descending
  • orient (str (default 'v')) – One of ‘h’ for horizontal or ‘v’ for vertical
  • title (str) – The figure title.
  • color (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign color to marks.
  • hover_name (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in bold in the hover tooltip.
  • hover_data (list of str or int, or Series or array-like) – Either names of columns in data_frame, or pandas Series, or array_like objects Values from these columns appear as extra data in the hover tooltip.
  • custom_data (list of str or int, or Series or array-like) – Either names of columns in data_frame, or pandas Series, or array_like objects Values from these columns are extra data, to be used in widgets or Dash callbacks for example. This data is not user-visible but is included in events emitted by the figure (lasso selection etc.)
  • text (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in the figure as text labels.
  • animation_frame (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to animation frames.
  • animation_group (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to provide object-constancy across animation frames: rows with matching `animation_group`s will be treated as if they describe the same object in each frame.
  • labels (dict with str keys and str values (default {})) – By default, column names are used in the figure for axis titles, legend entries and hovers. This parameter allows this to be overridden. The keys of this dict should correspond to column names, and the values should correspond to the desired label to be displayed.
  • color_discrete_sequence (list of str) – Strings should define valid CSS-colors. When color is set and the values in the corresponding column are not numeric, values in that column are assigned colors by cycling through color_discrete_sequence in the order described in category_orders, unless the value of color is a key in color_discrete_map. Various useful color sequences are available in the plotly.express.colors submodules, specifically plotly.express.colors.qualitative.
  • color_discrete_map (dict with str keys and str values (default {})) – String values should define valid CSS-colors Used to override color_discrete_sequence to assign a specific colors to marks corresponding with specific values. Keys in color_discrete_map should be values in the column denoted by color.
  • color_continuous_scale (list of str) – Strings should define valid CSS-colors. This list is used to build a continuous color scale when the column denoted by color contains numeric data. Various useful color scales are available in the plotly.express.colors submodules, specifically plotly.express.colors.sequential, plotly.express.colors.diverging and plotly.express.colors.cyclical.
  • opacity (float) – Value between 0 and 1. Sets the opacity for markers.
  • barmode (str (default 'relative')) – One of ‘group’, ‘overlay’ or ‘relative’ In ‘relative’ mode, bars are stacked above zero for positive values and below zero for negative values. In ‘overlay’ mode, bars are drawn on top of one another. In ‘group’ mode, bars are placed beside each other.
  • width (int (default None)) – The figure width in pixels.
  • height (int (default 600)) – The figure height in pixels.
  • output_file (str, optional) – Output html file name for image
Returns:

Plotly Figure Object of Bar Plot

Return type:

Plotly Figure

Examples

>>> data.barplot(x='x', y='y')
>>> data.barplot(x='x', method='mean')
>>> data.barplot(x='x', y='y', method='max', orient='h')
boxplot(x=None, y=None, color=None, title='', output_file='', **kwargs)

Plots a box plot for the given x and y columns.

For more info and kwargs for box plots, see https://plot.ly/python-api-reference/generated/plotly.express.box.html#plotly.express.box and https://plot.ly/python/box-plots/

Parameters:
  • x (str) – X axis column
  • y (str) – y axis column
  • color (str, optional) – Column name to add a dimension by color.
  • orient (str, optional) – Orientation of graph, ‘h’ for horizontal ‘v’ for vertical, by default ‘v’,
  • points (str, bool {'outlier', 'suspectedoutliers', 'all', False}) – One of ‘outliers’, ‘suspectedoutliers’, ‘all’, or False. If ‘outliers’, only the sample points lying outside the whiskers are shown. If ‘suspectedoutliers’, all outlier points are shown and those less than 4*Q1-3*Q3 or greater than 4*Q3-3*Q1 are highlighted with the marker’s ‘outliercolor’. If ‘outliers’, only the sample points lying outside the whiskers are shown. If ‘all’, all sample points are shown. If False, no sample points are shown and the whiskers extend to the full range of the sample.
  • notched (bool, optional) – If True, boxes are drawn with notches, by default False.
  • title (str, optional) – Title of the plot, by default “”.
  • output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
Returns:

Plotly Figure Object of Box Plot

Return type:

Plotly Figure

Examples

>>> data.boxplot(y='y', color='z')
>>> data.boxplot(x='x', y='y', color='z', points='all')
>>> data.boxplot(x='x', y='y', output_file='pair.png')
histogram(*x, hue=None, plot_test=False, output_file='', **kwargs)

Plots a histogram of the given column(s).

If no columns are provided, histograms are plotted for all numeric columns

For more histogram key word arguments, please see https://seaborn.pydata.org/generated/seaborn.distplot.html

Parameters:
  • x (str or str(s)) – Column(s) to plot histograms for.
  • hue (str, optional) – Column to colour points by, by default None
  • plot_test (bool, optional) – True to plot distribution of the test data for the same variable
  • bins (argument for matplotlib hist(), or None, optional) – Specification of hist bins, or None to use Freedman-Diaconis rule.
  • hist (bool, optional) – Whether to plot a (normed) histogram.
  • kde (bool, optional) – Whether to plot a gaussian kernel density estimate.
  • rug (bool, optional) – Whether to draw a rugplot on the support axis.
  • fit (random variable object, optional) – An object with fit method, returning a tuple that can be passed to a pdf method a positional arguments following an grid of values to evaluate the pdf on.
  • output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)

Examples

>>> data.histogram()
>>> data.histogram('col1')
>>> data.histogram('col1', 'col2', hue='col3', plot_test=True)
>>> data.histogram('col1', kde=False)
>>> data.histogram('col1', 'col2', hist=False)
>>> data.histogram('col1', kde=False, fit=stat.normal)
>>> data.histogram('col1', kde=False, output_file='hist.png')
jointplot(x: str, y: str, kind='scatter', output_file='', **kwargs)

Plots joint plots of 2 different variables.

Scatter (‘scatter’): Scatter plot and histograms of x and y.

Regression (‘reg’): Scatter plot, with regression line and histograms with kernel density fits.

Residuals (‘resid’): Scatter plot of residuals and histograms of residuals.

Kernel Density Estimates (‘kde’): Density estimate plot and histograms.

Hex (‘hex’): Replaces scatterplot with joint histogram using hexagonal bins and histograms on the axes.

For more info and kwargs for joint plots, see https://seaborn.pydata.org/generated/seaborn.jointplot.html

Parameters:
  • x (str) – X axis column
  • y (str) – y axis column
  • kind ({ “scatter” | “reg” | “resid” | “kde” | “hex” }, optional) – Kind of plot to draw, by default ‘scatter’
  • color (matplotlib color, optional) – Color used for the plot elements.
  • dropna (bool, optional) – If True, remove observations that are missing from x and y.
  • y}lim ({x,) – Axis limits to set before plotting.
  • marginal, annot}_kws ({joint,) – Additional keyword arguments for the plot components.
  • output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)

Examples

>>> data.jointplot(x='x', y='y', kind='kde', color='crimson')
>>> data.jointplot(x='x', y='y', kind='kde', color='crimson', output_file='pair.png')
lineplot(x: str, y: str, z=None, color=None, title='Line Plot', output_file='', **lineplot_kwargs)

Plots a lineplot for the given x and y columns provided using Plotly Express.

For a list of possible lineplot_kwargs please check out the following links:

For 2d:

For 3d:

Parameters:
  • x (str) – X column name
  • y (str) – Column name to plot on the y axis.
  • z (str) – Column name to plot on the z axis.
  • title (str, optional) – Title of the plot, by default ‘Line Plot’
  • color (str) – Category column to draw multiple line plots of
  • output_file (str, optional) – Output html file name for image
  • text (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in the figure as text labels.
  • facet_row (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to facetted subplots in the vertical direction.
  • facet_col (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to facetted subplots in the horizontal direction.
  • error_x (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size x-axis error bars. If error_x_minus is None, error bars will be symmetrical, otherwise error_x is used for the positive direction only.
  • error_x_minus (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size x-axis error bars in the negative direction. Ignored if error_x is None.
  • error_y (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size y-axis error bars. If error_y_minus is None, error bars will be symmetrical, otherwise error_y is used for the positive direction only.
  • error_y_minus (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size y-axis error bars in the negative direction. Ignored if error_y is None.
  • animation_frame (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to animation frames.
  • animation_group (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to provide object-constancy across animation frames: rows with matching `animation_group`s will be treated as if they describe the same object in each frame.
  • labels (dict with str keys and str values (default {})) – By default, column names are used in the figure for axis titles, legend entries and hovers. his parameter allows this to be overridden. The keys of this dict should correspond to column names, and the values should correspond to the desired label to be displayed.
  • color_discrete_sequence (list of str) – Strings should define valid CSS-colors. When color is set and the values in the corresponding column are not numeric, values in that column are assigned colors by cycling through color_discrete_sequence in the order described in category_orders, unless the value of color is a key in color_discrete_map. Various useful color sequences are available in the plotly.express.colors submodules, specifically plotly.express.colors.qualitative.
  • color_discrete_map (dict with str keys and str values (default {})) – String values should define valid CSS-colors Used to override color_discrete_sequence to assign a specific colors to marks corresponding with specific values. Keys in color_discrete_map should be values in the column denoted by color.
Returns:

Plotly Figure Object of Line Plot

Return type:

Plotly Figure

Examples

>>> data.line_plot(x='x', y='y')
>>> data.line_plot(x='x', y='y', output_file='line')
pairplot(cols=[], kind='scatter', diag_kind='auto', upper_kind=None, lower_kind=None, hue=None, output_file='', **kwargs)

Plots pairplots of the variables from the training data.

If hue is not provided and a target variable is set, the data will separated and highlighted by the classes in that column.

For more info and kwargs on pair plots, please see: https://seaborn.pydata.org/generated/seaborn.pairplot.html

Parameters:
  • cols (list) – Columns to view pairplot of.
  • kind (str {'scatter', 'reg'}, optional) – Type of plot for off-diag plots, by default ‘scatter’
  • diag_kind (str {'auto', 'hist', 'kde'}, optional) – Type of plot for diagonal, by default ‘auto’
  • upper_kind (str {'scatter', 'kde'}, optional) – Type of plot for upper triangle of pair plot, by default None
  • lower_kind (str {'scatter', 'kde'}, optional) – Type of plot for lower triangle of pair plot, by default None
  • hue (str, optional) – Column to colour points by, by default None
  • y}_vars ({x,) – Variables within data to use separately for the rows and columns of the figure; i.e. to make a non-square plot.
  • palette (dict or seaborn color palette) – Set of colors for mapping the hue variable. If a dict, keys should be values in the hue variable.
  • output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)

Examples

>>> data.pairplot(kind='kde')
>>> data.pairplot(kind='kde', output_file='pair.png')
pieplot(values: str, names: str, title='', textposition='inside', textinfo='percent', output_file='', **pieplot_kwargs)

Plots a Pie plot of a given column.

For more information regarding pie plots please see the following links: https://plot.ly/python/pie-charts/#customizing-a-pie-chart-created-with-pxpie and https://plot.ly/python-api-reference/generated/plotly.express.pie.html#plotly.express.pie.

Parameters:
  • values (str) – Column in the DataFrame. Values from this column or array_like are used to set values associated to sectors.
  • names (str) – Column in the DataFrame. Values from this column or array_like are used as labels for sectors.
  • title (str, optional) – The figure title, by default ‘’
  • textposition ({'inside', 'outside'}, optional) – Position the text in the plot, by default ‘inside’
  • textinfo (str, optional) –
    textinfo’ can take any of the following values, joined with a ‘+’:
    ’label’ - displays the label on the segment ‘text’ - displays the text on the segment (this can be set separately to the label) ‘value’ - displays the value passed into the trace ‘percent’ - displayed the computer percentage
  • by default 'percent' (,) –
  • output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
  • color (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign color to marks.
  • color_discrete_sequence (list of str) – Strings should define valid CSS-colors. When color is set and the values in the corresponding column are not numeric, values in that column are assigned colors by cycling through color_discrete_sequence in the order described in category_orders, unless the value of color is a key in color_discrete_map. Various useful color sequences are available in the plotly.express.colors submodules, specifically plotly.express.colors.qualitative.
  • color_discrete_map (dict with str keys and str values (default {})) – String values should define valid CSS-colors Used to override color_discrete_sequence to assign a specific colors to marks corresponding with specific values. Keys in color_discrete_map should be values in the column denoted by color.
  • hover_name (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in bold in the hover tooltip.
  • hover_data (list of str or int, or Series or array-like) – Either names of columns in data_frame, or pandas Series, or array_like objects. Values from these columns appear as extra data in the hover tooltip.
  • custom_data (list of str or int, or Series or array-like) – Either names of columns in data_frame, or pandas Series, or array_like objects Values from these columns are extra data, to be used in widgets or Dash callbacks for example. This data is not user-visible but is included in events emitted by the figure (lasso selection etc.)
  • labels (dict with str keys and str values (default {})) – By default, column names are used in the figure for axis titles, legend entries and hovers. This parameter allows this to be overridden. The keys of this dict should correspond to column names, and the values should correspond to the desired label to be displayed.
  • width (int (default None)) – The figure width in pixels.
  • height (int (default 600)) – The figure height in pixels.
  • opacity (float) – Value between 0 and 1. Sets the opacity for markers.
  • hole (float) – Value between 0 and 1. Sets the size of the hole in the middle of the pie chart.
Returns:

Plotly Figure Object of Pie Chart

Return type:

Plotly Figure

Examples

>>> data.pieplot(val_column, name_column)
plot_colorpalettes

Displays color palette configuration guide.

plot_colors

Displays all plot colour names

plot_dim_reduction(col: str, dim=2, algo='tsne', output_file='', **kwargs)

Reduce the dimensions of your data and then view similarly grouped data points (clusters)

For 2d plotting options, see:

For 3d plotting options, see:

Parameters:
  • col (str) – Column name of the labels/data points to highlight in the plot
  • dim (int {2, 3}) – Dimensions of the plot to show, either 2d or 3d, by default 2
  • algo (str {'tsne', 'lle', 'pca', 'tsvd'}, optional) – Algorithm to reduce the dimensions by, by default ‘tsne’
  • output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
  • kwargs – See plotting options
Returns:

Plotly Figure Object of Scatter Plot

Return type:

Plotly Figure

Examples

>>> data.plot_dim_reduction('cluster_labels', dim=3)
raincloud(x=None, y=None, output_file='', **params)

Combines the box plot, scatter plot and split violin plot into one data visualization. This is used to offer eyeballed statistical inference, assessment of data distributions (useful to check assumptions), and the raw data itself showing outliers and underlying patterns.

A raincloud is made of: 1) “Cloud”, kernel desity estimate, the half of a violinplot. 2) “Rain”, a stripplot below the cloud 3) “Umberella”, a boxplot 4) “Thunder”, a pointplot connecting the mean of the different categories (if pointplot is True)

https://seaborn.pydata.org/generated/seaborn.boxplot.html

https://seaborn.pydata.org/generated/seaborn.violinplot.html

https://seaborn.pydata.org/generated/seaborn.stripplot.html

Parameters:
  • x (str) – X axis data, reference by column name, any data
  • y (str) – Y axis data, reference by column name, measurable data (numeric) by default target
  • hue (Iterable, np.array, or dataframe column name if 'data' is specified) – Second categorical data. Use it to obtain different clouds and rainpoints
  • output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
  • orient (str) – vertical if “v” (default), horizontal if “h”
  • width_viol (float) – width of the cloud
  • width_box (float) – width of the boxplot
  • palette (list or dict) – Colours to use for the different levels of categorical variables
  • bw (str or float) – Either the name of a reference rule or the scale factor to use when computing the kernel bandwidth, by default “scott”
  • linewidth (float) – width of the lines
  • cut (float) – Distance, in units of bandwidth size, to extend the density past the extreme datapoints. Set to 0 to limit the violin range within the range of the observed data, by default 2
  • scale (str) – The method used to scale the width of each violin. If area, each violin will have the same area. If count, the width of the violins will be scaled by the number of observations in that bin. If width, each violin will have the same width. By default “area”
  • jitter (float, True/1) – Amount of jitter (only along the categorical axis) to apply. This can be useful when you have many points and they overlap, so that it is easier to see the distribution. You can specify the amount of jitter (half the width of the uniform random variable support), or just use True for a good default.
  • move (float) – adjust rain position to the x-axis (default value 0.)
  • offset (float) – adjust cloud position to the x-axis
  • color (matplotlib color) – Color for all of the elements, or seed for a gradient palette.
  • ax (matplotlib axes) – Axes object to draw the plot onto, otherwise uses the current Axes.
  • figsize ((int, int)) – size of the visualization, ex (12, 5)
  • pointplot (bool) – line that connects the means of all categories, by default False
  • dodge (bool) – When hue nesting is used, whether elements should be shifted along the categorical axis.
  • Source (https://micahallen.org/2018/03/15/introducing-raincloud-plots/) –

Examples

>>> data.raincloud('col1') # Will plot col1 values on the x axis and your target variable values on the y axis
>>> data.raincloud('col1', 'col2') # Will plot col1 on the x and col2 on the y axis
>>> data.raincloud('col1', 'col2', output_file='raincloud.png')
scatterplot(x=None, y=None, z=None, color=None, title='Scatter Plot', output_file='', **scatterplot_kwargs)

Plots a scatterplot for the given x and y columns provided using Plotly Express.

For a list of possible scatterplot_kwargs for 2 dimensional data please check out the following links:

For more information on key word arguments for 3d data, please check them out here:

Parameters:
  • x (str) – X column name
  • y (str) – Y column name
  • z (str) – Z column name,
  • color (str, optional) – Category to group your data, by default None
  • title (str, optional) – Title of the plot, by default ‘Scatter Plot’
  • output_file (str, optional) – Output html file name for image
  • symbol (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign symbols to marks.
  • size (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign mark sizes.
  • hover_name (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in bold in the hover tooltip.
  • hover_data (list of str or int, or Series or array-like, or dict)) – Either a list of names of columns in data_frame, or pandas Series, or array_like objects or a dict with column names as keys, with values True (for default formatting) False (in order to remove this column from hover information), or a formatting string, for example ‘:.3f’ or ‘|%a’ or list-like data to appear in the hover tooltip or tuples with a bool or formatting string as first element, and list-like data to appear in hover as second element Values from these columns appear as extra data in the hover tooltip.
  • custom_data (list of str or int, or Series or array-like) – Either names of columns in data_frame, or pandas Series, or array_like objects Values from these columns are extra data, to be used in widgets or Dash callbacks for example. This data is not user-visible but is included in events emitted by the figure (lasso selection etc.)
  • text (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in the figure as text labels.
  • facet_row (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to facetted subplots in the vertical direction.
  • facet_col (str or int or Series or array-like)) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to facetted subplots in the horizontal direction.
  • facet_col_wrap (int) – Maximum number of facet columns. Wraps the column variable at this width, so that the column facets span multiple rows. Ignored if 0, and forced to 0 if facet_row or a marginal is set.
  • error_x (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size x-axis error bars. If error_x_minus is None, error bars will be symmetrical, otherwise error_x is used for the positive direction only.
  • error_x_minus (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size x-axis error bars in the negative direction. Ignored if error_x is None.
  • error_y (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size y-axis error bars. If error_y_minus is None, error bars will be symmetrical, otherwise error_y is used for the positive direction only.
  • error_y_minus (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size y-axis error bars in the negative direction. Ignored if error_y is None.
  • animation_frame (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to animation frames.
  • animation_group (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to provide object-constancy across animation frames: rows with matching `animation_group`s will be treated as if they describe the same object in each frame.
  • labels (dict with str keys and str values (default {})) – By default, column names are used in the figure for axis titles, legend entries and hovers. This parameter allows this to be overridden. The keys of this dict should correspond to column names, and the values should correspond to the desired label to be displayed.
  • color_discrete_sequence (list of str) – Strings should define valid CSS-colors. When color is set and the values in the corresponding column are not numeric, values in that column are assigned colors by cycling through color_discrete_sequence in the order described in category_orders, unless the value of color is a key in color_discrete_map. Various useful color sequences are available in the plotly.express.colors submodules, specifically plotly.express.colors.qualitative.
  • color_discrete_map (dict with str keys and str values (default {})) – String values should define valid CSS-colors Used to override color_discrete_sequence to assign a specific colors to marks corresponding with specific values. Keys in color_discrete_map should be values in the column denoted by color.
  • color_continuous_scale (list of str) – Strings should define valid CSS-colors This list is used to build a continuous color scale when the column denoted by color contains numeric data. Various useful color scales are available in the plotly.express.colors submodules, specifically plotly.express.colors.sequential, plotly.express.colors.diverging and plotly.express.colors.cyclical.
  • range_color (list of two numbers) – If provided, overrides auto-scaling on the continuous color scale.
  • color_continuous_midpoint (number (default None)) – If set, computes the bounds of the continuous color scale to have the desired midpoint. Setting this value is recommended when using plotly.express.colors.diverging color scales as the inputs to color_continuous_scale.
  • opacity (float) – Value between 0 and 1. Sets the opacity for markers.
  • size_max (int (default 20)) – Set the maximum mark size when using size.
  • marginal_x (str) – One of ‘rug’, ‘box’, ‘violin’, or ‘histogram’. If set, a horizontal subplot is drawn above the main plot, visualizing the x-distribution.
  • marginal_y (str) – One of ‘rug’, ‘box’, ‘violin’, or ‘histogram’. If set, a vertical subplot is drawn to the right of the main plot, visualizing the y-distribution.
  • trendline (str) – One of ‘ols’ or ‘lowess’. If ‘ols’, an Ordinary Least Squares regression line will be drawn for each discrete-color/symbol group. If ‘lowess’, a Locally Weighted Scatterplot Smoothing line will be drawn for each discrete-color/symbol group.
  • trendline_color_override (str)) – Valid CSS color. If provided, and if trendline is set, all trendlines will be drawn in this color.
  • log_x (boolean (default False)) – If True, the x-axis is log-scaled in cartesian coordinates.
  • log_y (boolean (default False)) – If True, the y-axis is log-scaled in cartesian coordinates.
  • range_x (list of two numbers) – If provided, overrides auto-scaling on the x-axis in cartesian coordinates.
  • range_y (list of two numbers) – If provided, overrides auto-scaling on the y-axis in cartesian coordinates.
  • width (int (default None)) – The figure width in pixels.
  • height (int (default None)) – The figure height in pixels.
Returns:

Plotly Figure Object of Scatter Plot

Return type:

Plotly Figure

Examples

>>> data.scatterplot(x='x', y='y') #2d
>>> data.scatterplot(x='x', y='y', z='z') #3d
>>> data.scatterplot(x='x', y='y', z='z', output_file='scatt')
violinplot(x=None, y=None, color=None, title='', output_file='', **kwargs)

Plots a violin plot for the given x and y columns.

For more info and kwargs for violin plots, see https://plot.ly/python-api-reference/generated/plotly.express.violin.html#plotly.express.violin and https://plot.ly/python/violin/

Parameters:
  • x (str) – X axis column
  • y (str) – y axis column
  • color (str, optional) – Column name to add a dimension by color.
  • orient (str, optional) – Orientation of graph, ‘h’ for horizontal ‘v’ for vertical, by default ‘v’,
  • points (str, bool {'outlier', 'suspectedoutliers', 'all', False}) – One of ‘outliers’, ‘suspectedoutliers’, ‘all’, or False. If ‘outliers’, only the sample points lying outside the whiskers are shown. If ‘suspectedoutliers’, all outlier points are shown and those less than 4*Q1-3*Q3 or greater than 4*Q3-3*Q1 are highlighted with the marker’s ‘outliercolor’. If ‘outliers’, only the sample points lying outside the whiskers are shown. If ‘all’, all sample points are shown. If False, no sample points are shown and the whiskers extend to the full range of the sample.
  • violinmode (str {'group', 'overlay'}) – In ‘overlay’ mode, violins are on drawn top of one another. In ‘group’ mode, violins are placed beside each other.
  • box (bool, optional) – If True, boxes are drawn inside the violins.
  • title (str, optional) – Title of the plot, by default “”.
  • output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
Returns:

Plotly Figure Object of Violin Plot

Return type:

Plotly Figure

Examples

>>> data.violinplot(y='y', color='z', box=True)
>>> data.violinplot(x='x', y='y', color='z', points='all')
>>> data.violinplot(x='x', y='y', violinmode='overlay', output_file='pair.png')
class aethos.cleaning.Clean

Bases: object

drop_column_missing_threshold(threshold: float)

Remove columns from the dataframe that have greater than or equal to the threshold value of missing values. Example: Remove columns where >= 50% of the data is missing.

Parameters:threshold (float) – Value between 0 and 1 that describes what percentage of a column can be missing values.
Returns:Returns a deep copy of the Data object.
Return type:Data

Examples

>>> data.drop_column_missing_threshold(0.5)
drop_constant_columns()

Remove columns from the data that only have one unique value.

Returns:Returns a deep copy of the Data object.
Return type:Data

Examples

>>> data.drop_constant_columns()
drop_duplicate_columns()

Remove columns from the data that are exact duplicates of each other and leave only 1.

Returns:Returns a deep copy of the Data object.
Return type:Data

Examples

>>> data.drop_duplicate_columns()
drop_duplicate_rows(*list_args, list_of_cols=[])

Remove rows from the data that are exact duplicates of each other and leave only 1. This can be used to reduce processing time or performance for algorithms where duplicates have no effect on the outcome (i.e DBSCAN)

If a list of columns is provided use the list, otherwise use arguemnts.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.drop_duplicate_rows('col1', 'col2') # Only look at columns 1 and 2
>>> data.drop_duplicate_rows(['col1', 'col2'])
>>> data.drop_duplicate_rows()
drop_rows_missing_threshold(threshold: float)

Remove rows from the dataframe that have greater than or equal to the threshold value of missing rows. Example: Remove rows where > 50% of the data is missing.

Parameters:threshold (float) – Value between 0 and 1 that describes what percentage of a row can be missing values.
Returns:Returns a deep copy of the Data object.
Return type:Data

Examples

>>> data.drop_rows_missing_threshold(0.5)
drop_unique_columns()

Remove columns from the data that only have one unique value.

Returns:Returns a deep copy of the Data object.
Return type:Data

Examples

>>> data.drop_unique_columns()
replace_missing_backfill(*list_args, list_of_cols=[], **extra_kwargs)

Replaces missing values in a column with the next known data point.

This is useful when dealing with timeseries data and you want to replace data in the past with data from the future.

For more info view the following link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.replace_missing_backfill('col1', 'col2')
>>> data.replace_missing_backfill(['col1', 'col2'])
replace_missing_constant(*list_args, list_of_cols=[], constant=0, col_mapping=None)

Replaces missing values in every numeric column with a constant.

If no columns are supplied, missing values will be replaced with the mean in every numeric column.

If a list of columns is provided use the list, otherwise use arguemnts.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • constant (int or float, optional) – Numeric value to replace all missing values with , by default 0
  • col_mapping (dict, optional) – Dictionary mapping {‘ColumnName’: constant}, by default None
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.replace_missing_constant(col_mapping={'a': 1, 'b': 2, 'c': 3})
>>> data.replace_missing_constant('col1', 'col2', constant=2)
>>> data.replace_missing_constant(['col1', 'col2'], constant=3)
replace_missing_forwardfill(*list_args, list_of_cols=[], **extra_kwargs)

Replaces missing values in a column with the last known data point.

This is useful when dealing with timeseries data and you want to replace future missing data with the past.

For more info view the following link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.replace_missing_forwardfill('col1', 'col2')
>>> data.replace_missing_forwardfill(['col1', 'col2'])
replace_missing_indicator(*list_args, list_of_cols=[], missing_indicator=1, valid_indicator=0, keep_col=True)

Adds a new column describing whether data is missing for each record in a column.

This is useful if the missing data has meaning, aka not random.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • missing_indicator (int, optional) – Value to indicate missing data, by default 1
  • valid_indicator (int, optional) – Value to indicate non missing data, by default 0
  • keep_col (bool, optional) – True to keep column, False to replace it, by default False
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.replace_missing_indicator('col1', 'col2')
>>> data.replace_missing_indicator(['col1', 'col2'])
>>> data.replace_missing_indicator(['col1', 'col2'], missing_indicator='missing', valid_indicator='not missing', keep_col=False)
replace_missing_interpolate(*list_args, list_of_cols=[], method='linear', **inter_kwargs)

Replaces missing values with an interpolation method and possible extrapolation.

The possible interpolation methods are:

  • ‘linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
  • ‘time’: Works on daily and higher resolution data to interpolate given length of interval.
  • ‘index’, ‘values’: use the actual numerical values of the index.
  • ‘pad’: Fill in NaNs using existing values.
  • ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d.
    • These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method=’polynomial’, order=5).
  • ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’: Wrappers around the SciPy interpolation methods of similar names.
  • ‘from_derivatives’: Refers to scipy.interpolate.BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.

For more information see: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.interpolate.html or https://docs.scipy.org/doc/scipy/reference/interpolate.html.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • method (str, optional) – Interpolation method, by default ‘linear’
  • limit (int, optional) – Maximum number of consecutive NaNs to fill. Must be greater than 0.
  • limit_area ({None, ‘inside’, ‘outside’}, default None) –

    If limit is specified, consecutive NaNs will be filled with this restriction.

    • None: No fill restriction.
    • ‘inside’: Only fill NaNs surrounded by valid values (interpolate).
    • ‘outside’: Only fill NaNs outside valid values (extrapolate).
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.replace_missing_interpolate('col1', 'col2')
>>> data.replace_missing_interpolate(['col1', 'col2'])
>>> data.replace_missing_interpolate('col1', 'col2', method='pad', limit=3)
replace_missing_knn(k=5, **knn_kwargs)

Replaces missing data with data from similar records based off a distance metric.

For more info see: https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer

Parameters:
  • missing_values (number, string, np.nan or None, default=`np.nan`) – The placeholder for the missing values. All occurrences of missing_values will be imputed.
  • k (int, default=5) – Number of neighboring samples to use for imputation.
  • weights ({‘uniform’, ‘distance’} or callable, default=’uniform’) –

    Weight function used in prediction. Possible values:

    ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.

    ‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.

    callable : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.

  • metric ({‘nan_euclidean’} or callable, default=’nan_euclidean’) –

    Distance metric for searching neighbors. Possible values:

    ‘nan_euclidean’

    callable : a user-defined function which conforms to the definition of _pairwise_callable(X, Y, metric, **kwds). The function accepts two arrays, X and Y, and a missing_values keyword in kwds and returns a scalar distance value.

  • add_indicator (bool, default=False) – If True, a MissingIndicator transform will stack onto the output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.replace_missing_knn(k=8)
replace_missing_mean(*list_args, list_of_cols=[])

Replaces missing values in every numeric column with the mean of that column.

If no columns are supplied, missing values will be replaced with the mean in every numeric column.

Mean: Average value of the column. Effected by outliers.

If a list of columns is provided use the list, otherwise use arguemnts.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to
  • list_of_cols (list, optional) – Specific columns to apply this technique to, by default []
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.replace_missing_mean('col1', 'col2')
>>> data.replace_missing_mean(['col1', 'col2'])
replace_missing_median(*list_args, list_of_cols=[])

Replaces missing values in every numeric column with the median of that column.

If no columns are supplied, missing values will be replaced with the mean in every numeric column.

Median: Middle value of a list of numbers. Equal to the mean if data follows normal distribution. Not effected much by anomalies.

If a list of columns is provided use the list, otherwise use arguemnts.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – Specific columns to apply this technique to., by default []
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.replace_missing_median('col1', 'col2')
>>> data.replace_missing_median(['col1', 'col2'])
replace_missing_mostcommon(*list_args, list_of_cols=[])

Replaces missing values in every numeric column with the most common value of that column

Mode: Most common value.

If a list of columns is provided use the list, otherwise use arguemnts.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.replace_missing_mostcommon('col1', 'col2')
>>> data.replace_missing_mostcommon(['col1', 'col2'])
replace_missing_new_category(*list_args, list_of_cols=[], new_category=None, col_mapping=None)

Replaces missing values in categorical column with its own category. The categories can be autochosen from the defaults set.

For numeric categorical columns default values are: -1, -999, -9999 For string categorical columns default values are: “Other”, “Unknown”, “MissingDataCategory”

If a list of columns is provided use the list, otherwise use arguemnts.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • new_category (str, int, or float, optional) – Category to replace missing values with, by default None
  • col_mapping (dict, optional) – Dictionary mapping {‘ColumnName’: constant}, by default None
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.replace_missing_new_category(col_mapping={'col1': "Green", 'col2': "Canada", 'col3': "December"})
>>> data.replace_missing_new_category('col1', 'col2', 'col3', new_category='Blue')
>>> data.replace_missing_new_category(['col1', 'col2', 'col3'], new_category='Blue')
replace_missing_random_discrete(*list_args, list_of_cols=[])

Replace missing values in with a random number based off the distribution (number of occurences) of the data.

For example if your data was [5, 5, NaN, 1, 2] There would be a 50% chance that the NaN would be replaced with a 5, a 25% chance for 1 and a 25% chance for 2.

If a list of columns is provided use the list, otherwise use arguemnts.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.replace_missing_random_discrete('col1', 'col2')
>>> data.replace_missing_random_discrete(['col1', 'col2'])
replace_missing_remove_row(*list_args, list_of_cols=[])

Remove rows where the value of a column for those rows is missing.

If a list of columns is provided use the list, otherwise use arguemnts.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.replace_missing_remove_row('col1', 'col2')
>>> data.replace_missing_remove_row(['col1', 'col2'])
class aethos.preprocessing.Preprocess

Bases: object

clean_text(*list_args, list_of_cols=[], lower=True, punctuation=True, stopwords=True, stemmer=True, numbers=True, new_col_name='_clean')

Function that takes text and does the following:

  • Casts it to lowercase
  • Removes punctuation
  • Removes stopwords
  • Stems the text
  • Removes any numerical text
Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • lower (bool, optional) – True to cast all text to lowercase, by default True
  • punctuation (bool, optional) – True to remove punctuation, by default True
  • stopwords (bool, optional) – True to remove stop words, by default True
  • stemmer (bool, optional) – True to stem the data, by default True
  • numbers (bool, optional) – True to remove numerical data, by default True
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_clean
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.clean_text('col1')
>>> data.clean_text(['col1', 'col2'], lower=False)
>>> data.clean_text(lower=False, stopwords=False, stemmer=False)
normalize_log(*list_args, list_of_cols=[], base=1)

Scales data logarithmically.

Options are 1 for natural log, 2 for base2, 10 for base10.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • base (str, optional) – Base to logarithmically scale by, by default ‘’
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.normalize_log('col1')
>>> data.normalize_log(['col1', 'col2'], base=10)
normalize_numeric(*list_args, list_of_cols=[], **normalize_params)

Function that normalizes all numeric values between 2 values to bring features into same domain.

If list_of_cols is not provided, the strategy will be applied to all numeric columns.

If a list of columns is provided use the list, otherwise use arguments.

For more info please see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • feature_range (tuple(int or float, int or float), optional) – Min and max range to normalize values to, by default (0, 1)
  • normalize_params (dict, optional) – Parmaters to pass into MinMaxScaler() constructor from Scikit-Learn
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.normalize_numeric('col1')
>>> data.normalize_numeric(['col1', 'col2'])
normalize_quantile_range(*list_args, list_of_cols=[], **robust_params)

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.

If list_of_cols is not provided, the strategy will be applied to all numeric columns.

If a list of columns is provided use the list, otherwise use arguments.

For more info please see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • with_centering (boolean, True by default) – If True, center the data before scaling. This will cause transform to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
  • with_scaling (boolean, True by default) – If True, scale the data to interquartile range.
  • quantile_range (tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0) – Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR Quantile range used to calculate scale_.
  • robust_params (dict, optional) – Parmaters to pass into MinMaxScaler() constructor from Scikit-Learn
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.normalize_quantile_range('col1')
>>> data.normalize_quantile_range(['col1', 'col2'])
remove_numbers(*list_args, list_of_cols=[], new_col_name='_rem_num')

Removes numbers from text in a column.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_num
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.remove_numbers('col1', new_col_name="text_wo_num)
remove_punctuation(*list_args, list_of_cols=[], regexp='', exceptions=[], new_col_name='_rem_punct')

Removes punctuation from every string entry.

Defaults to removing all punctuation, but if regex of punctuation is provided, it will remove them.

An example regex would be:

(w+.|w+)[^,] - Include all words and words with periods after them but don’t include commas. (w+.)|(w+), would also achieve the same result

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • regexp (str, optional) – Regex expression used to define what to include.
  • exceptions (list, optional) – List of punctuation to include in the text, by default []
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_punct
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.remove_punctuation('col1')
>>> data.remove_punctuation(['col1', 'col2'])
>>> data.remove_punctuation('col1', regexp=r'(\w+\.)|(\w+)') # Include all words and words with periods after.
remove_stopwords_nltk(*list_args, list_of_cols=[], custom_stopwords=[], new_col_name='_rem_stop')

Removes stopwords following the nltk English stopwords list.

A list of custom words can be provided as well, usually for domain specific words.

Stop words are generally the most common words in a language

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • custom_stop_words (list, optional) – Custom list of words to also drop with the stop words, must be LOWERCASE, by default []
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_stop
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.remove_stopwords_nltk('col1')
>>> data.remove_stopwords_nltk(['col1', 'col2'])
split_sentences(*list_args, list_of_cols=[], new_col_name='_sentences')

Splits text data into sentences and saves it into another column for analysis.

If a list of columns is provided use the list, otherwise use arguments.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_sentences
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.split_sentences('col1')
>>> data.split_sentences(['col1', 'col2'])
split_words_nltk(*list_args, list_of_cols=[], regexp='', new_col_name='_tokenized')

Splits text into its words using nltk punkt tokenizer by default.

Default is by spaces and punctuation but if a regex expression is provided, it will use that.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • regexp (str, optional) – Regex expression used to define what a word is.
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_tokenized
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.split_words_nltk('col1')
>>> data.split_words_nltk(['col1', 'col2'])
stem_nltk(*list_args, list_of_cols=[], stemmer='porter', new_col_name='_stemmed')

Transforms text to their word stem, base or root form. For example:

dogs –> dog churches –> church abaci –> abacus
Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • stemmer (str, optional) –

    Type of NLTK stemmer to use, by default porter

    Current stemming implementations:
    • porter
    • snowball

    For more information please refer to the NLTK stemming api https://www.nltk.org/api/nltk.stem.html

  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_stemmed
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.stem_nltk('col1')
>>> data.stem_nltk(['col1', 'col2'], stemmer='snowball')
class aethos.feature_engineering.Feature

Bases: object

apply(func, output_col: str)

Calls pandas apply function. Will apply the function to your dataset, or both your training and testing dataset.

Parameters:
  • func (Function pointer) – Function describing the transformation for the new column
  • output_col (str) – New column name
Returns:

Returns a deep copy of the Feature object.

Return type:

Feature

Examples

>>>     col1  col2  col3
    0     1     0     1
    1     0     2     0
    2     1     0     1
>>> data.apply(lambda x: x['col1'] > 0, 'col4')
>>>     col1  col2  col3  col4
    0     1     0     1     1
    1     0     2     0     0
    2     1     0     1     1
bag_of_words(*list_args, list_of_cols=[], keep_col=True, **bow_kwargs)

Creates a matrix of how many times a word appears in a document.

The premise is that the more times a word appears the more the word represents that document.

For more information see: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

If a list of columns is provided use the list, otherwise use arguments.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • keep_col (bool, optional) – True if you want to keep the column(s) or False if you want to drop the column(s)
  • encoding (str, default=’utf-8’) – If bytes or files are given to analyze, this encoding is used to decode.
  • decode_error ({‘strict’, ‘ignore’, ‘replace’} (default=’strict’)) – Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.
  • strip_accents ({‘ascii’, ‘unicode’, None} (default=None)) –

    Remove accents and perform other character normalization during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.

    Both ‘ascii’ and ‘unicode’ use NFKD normalization from unicodedata.normalize.

  • lowercase (bool (default=True)) – Convert all characters to lowercase before tokenizing.
  • preprocessor (callable or None (default=None)) – Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. Only applies if analyzer is not callable.
  • tokenizer (callable or None (default=None)) – Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == ‘word’.
  • analyzer (str, {‘word’, ‘char’, ‘char_wb’} or callable) –

    Whether the feature should be made of word or character n-grams Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

    If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.

  • stop_words (str {‘english’}, list, or None (default=None)) –

    If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value. There are several known issues with ‘english’ and you should consider an alternative (see Using stop words).

    If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == ‘word’.

    If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

  • token_pattern (str) – Regular expression denoting what constitutes a “token”, only used if analyzer == ‘word’. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
  • ngram_range (tuple (min_n, max_n), default=(1, 1)) – The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.
  • max_df (float in range [0.0, 1.0] or int (default=1.0)) – When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
  • min_df (float in range [0.0, 1.0] or int (default=1)) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
  • max_features (int or None (default=None)) –

    If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

    This parameter is ignored if vocabulary is not None.

  • vocabulary (Mapping or iterable, optional (default=None)) – Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.
  • binary (bool (default=False)) – If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary. (Set idf and normalization to False to get 0/1 outputs).
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.bag_of_words('col1', 'col2', 'col3')
>>> data.bag_of_words('col1', 'col2', 'col3', binary=True)
chi2_feature_selection(k: int, verbose=False)

Uses Chi2 to choose the best K features.

The Chi2 null hypothesis is that 2 variables are independent.

Chi-square test feature selection “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.

Parameters:
  • k (int or "all") – Number of features to keep.
  • verbose (bool) – True to print p-values for each feature, by default False
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.chi2_feature_selection(k=10)
drop_correlated_features(threshold=0.95)

Drop features that have a correlation coefficient greater than the specified threshold with other features.

Parameters:threshold (float, optional) – Correlation coefficient threshold, by default 0.95
Returns:Returns a deep copy of the Data object.
Return type:Data

Examples

>>> data.drop_correlated_features(threshold=0.9)
nounphrases_nltk(*list_args, list_of_cols=[], new_col_name='_phrases')

Extract noun phrases from text using the Textblob packages which uses the NLTK NLP engine.

If a list of columns is provided use the list, otherwise use arguments.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_phrases
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.nounphrases_nltk('col1', 'col2', 'col3')
nounphrases_spacy(*list_args, list_of_cols=[], new_col_name='_phrases')

Extract noun phrases from text using the Textblob packages which uses the NLTK NLP engine.

If a list of columns is provided use the list, otherwise use arguments.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_phrases
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.nounphrases_spacy('col1', 'col2', 'col3')
onehot_encode(*list_args, list_of_cols=[], keep_col=True, **onehot_kwargs)

Creates a matrix of converted categorical columns into binary columns of ones and zeros.

For more info see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

If a list of columns is provided use the list, otherwise use arguments.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • keep_col (bool) – A parameter to specify whether to drop the column being transformed, by default keep the column, True
  • categories (‘auto’ or a list of array-like, default=’auto’) –

    Categories (unique values) per feature:

    ‘auto’ : Determine categories automatically from the training data.

    list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.

    The used categories can be found in the categories_ attribute.

  • drop (‘first’ or a array-like of shape (n_features,), default=None) –

    Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

    None : retain all features (the default).

    ‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.

    array : drop[i] is the category in feature X[:, i] that should be dropped.

  • sparsebool (default=True) – Will return sparse matrix if set True else will return an array.
  • dtype (number type, default=np.float) – Desired dtype of output.
  • handle_unknown ({‘error’, ‘ignore’}, default='ignore') – Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.onehot_encode('col1', 'col2', 'col3')
>>> data.onehot_encode('col1', 'col2', 'col3', drop='first')
ordinal_encode_labels(col: str, ordered_cat=[])

Encode categorical values with value between 0 and n_classes-1.

Running this function will automatically set the corresponding mapping for the target variable mapping number to the original value.

Note: that this will not work if your test data will have labels that your train data does not. Note:

Parameters:
  • col (str, optional) – Columnm in the data to ordinally encode.
  • ordered_cat (list, optional) – A list of ordered categories for the Ordinal encoder. Should be sorted.
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.encode_labels('col1')
>>> data.encode_labels('col1', ordered_cat=["Low", "Medium", "High"])
pca(n_components=10, **pca_kwargs)

Reduces the dimensionality of the data using Principal Component Analysis.

Use PCA when the data is dense.

This can be used to reduce complexity as well as speed up computation.

For more info please see: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

This function exists in feature-extraction/util.py

Parameters:
  • n_components (int, float, None or string, by default 10) –

    Number of components to keep.

    If n_components is not set all components are kept. If n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used to guess the dimension. Use of n_components == ‘mle’ will interpret svd_solver == ‘auto’ as svd_solver == ‘full’ If 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components If svd_solver == ‘arpack’, the number of components must be strictly less than the minimum of n_features and n_samples

  • whiten (bool, optional (default False)) – When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hard-wired assumptions.
  • svd_solver (string {‘auto’, ‘full’, ‘arpack’, ‘randomized’}) –
    auto :
    the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
    full :
    run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing
    arpack :
    run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)
    randomized :
    run randomized SVD by the method of Halko et al.
  • tol (float >= 0, optional (default .0)) – Tolerance for singular values computed by svd_solver == ‘arpack’.
  • iterated_power (int >= 0, or ‘auto’, (default ‘auto’)) – Number of iterations for the power method computed by svd_solver == ‘randomized’.
  • random_state (int, RandomState instance or None, optional (default None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when svd_solver == ‘arpack’ or ‘randomized’.
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.pca(n_components=2)
polynomial_features(*list_args, list_of_cols=[], **poly_kwargs)

Generate polynomial and interaction features.

Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.

For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • degree (int) – Degree of the polynomial features, by default 2
  • interaction_only (boolean,) – If true, only interaction features are produced: features that are products of at most degree distinct input features (so not x[1] ** 2, x[0] * x[2] ** 3, etc.). by default = False
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.polynomial_features('col1', 'col2', 'col3')
postag_nltk(*list_args, list_of_cols=[], new_col_name='_postagged')

Tag documents with their respective “Part of Speech” tag with the Textblob package which utilizes the NLTK NLP engine and Penn Treebank tag set. These tags classify a word as a noun, verb, adjective, etc. A full list and their meaning can be found here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

If a list of columns is provided use the list, otherwise use arguments.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_postagged
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.postag_nltk('col1', 'col2', 'col3')
postag_spacy(*list_args, list_of_cols=[], new_col_name='_postagged')

Tag documents with their respective “Part of Speech” tag with the Spacy NLP engine and the Universal Dependencies scheme. These tags classify a word as a noun, verb, adjective, etc. A full list and their meaning can be found here: https://spacy.io/api/annotation#pos-tagging

If a list of columns is provided use the list, otherwise use arguments.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_postagged
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.postag_spacy('col1', 'col2', 'col3')
postag_spacy_detailed(*list_args, list_of_cols=[], new_col_name='_postagged')

Tag documents with their respective “Part of Speech” tag with the Spacy NLP engine and the PennState PoS tags. These tags classify a word as a noun, verb, adjective, etc. A full list and their meaning can be found here: https://spacy.io/api/annotation#pos-tagging

If a list of columns is provided use the list, otherwise use arguments.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_postagged
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.postag_spacy_detailed('col1', 'col2', 'col3')
text_hash(*list_args, list_of_cols=[], keep_col=True, **hash_kwargs)

Creates a matrix of how many times a word appears in a document. It can possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.

The premise is that the more times a word appears the more the word represents that document.

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

It is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory It is fast to pickle and un-pickle as it holds no state besides the constructor parameters It can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

For more info please see: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html

If a list of columns is provided use the list, otherwise use arguments.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • keep_col (bool, optional) – True if you want to keep the column(s) or False if you want to drop the column(s)
  • n_features (integer, default=(2 ** 20)) – The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.
  • hash_kwargs (dict, optional) – Parameters you would pass into Bag of Words constructor, by default {}
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.text_hash('col1', 'col2', 'col3')
>>> data.text_hash('col1', 'col2', 'col3', n_features=50)
tfidf(*list_args, list_of_cols=[], keep_col=True, **tfidf_kwargs)

Creates a matrix of the tf-idf score for every word in the corpus as it pertains to each document.

The higher the score the more important a word is to a document, the lower the score (relative to the other scores) the less important a word is to a document.

For more information see: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

If a list of columns is provided use the list, otherwise use arguments.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • keep_col (bool, optional) – True if you want to keep the column(s) or False if you want to drop the column(s)
  • encoding (str, default=’utf-8’) – If bytes or files are given to analyze, this encoding is used to decode.
  • decode_error ({‘strict’, ‘ignore’, ‘replace’} (default=’strict’)) – Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.
  • strip_accents ({‘ascii’, ‘unicode’, None} (default=None)) –

    Remove accents and perform other character normalization during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.

    Both ‘ascii’ and ‘unicode’ use NFKD normalization from unicodedata.normalize.

  • lowercase (bool (default=True)) – Convert all characters to lowercase before tokenizing.
  • preprocessor (callable or None (default=None)) – Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. Only applies if analyzer is not callable.
  • tokenizer (callable or None (default=None)) – Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == ‘word’.
  • analyzer (str, {‘word’, ‘char’, ‘char_wb’} or callable) –

    Whether the feature should be made of word or character n-grams Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

    If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.

  • stop_words (str {‘english’}, list, or None (default=None)) –

    If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value. There are several known issues with ‘english’ and you should consider an alternative (see Using stop words).

    If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == ‘word’.

    If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

  • token_pattern (str) – Regular expression denoting what constitutes a “token”, only used if analyzer == ‘word’. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
  • ngram_range (tuple (min_n, max_n), default=(1, 1)) – The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.
  • max_df (float in range [0.0, 1.0] or int (default=1.0)) – When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
  • min_df (float in range [0.0, 1.0] or int (default=1)) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
  • max_features (int or None (default=None)) –

    If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

    This parameter is ignored if vocabulary is not None.

  • vocabulary (Mapping or iterable, optional (default=None)) – Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.
  • binary (bool (default=False)) – If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary. (Set idf and normalization to False to get 0/1 outputs).
  • dtype (type, optional (default=float64)) – Type of the matrix returned by fit_transform() or transform().
  • norm (‘l1’, ‘l2’ or None, optional (default=’l2’)) – Each output row will have unit norm, either: * ‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied. * ‘l1’: Sum of absolute values of vector elements is 1.
  • use_idf (bool (default=True)) – Enable inverse-document-frequency reweighting.
  • smooth_idf (bool (default=True)) – Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
  • sublinear_tf (bool (default=False)) – Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.tfidf('col1', 'col2', 'col3')
>>> data.tfidf('col1', 'col2', 'col3', lowercase=False, smoothidf=False)
truncated_svd(n_components=50, **svd_kwargs)

Reduces the dimensionality of the data using Truncated SVD.

In particular, truncated SVD works on term count/tf-idf matrices. In that context, it is known as latent semantic analysis (LSA).

Use Truncated SVD when the data is sparse.

This can be used to reduce complexity as well as speed up computation.

For more info please see: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

This function exists in feature-extraction/util.py

Parameters:
  • n_components (int, default = 2) – Desired dimensionality of output data. Must be strictly less than the number of features. The default value is useful for visualisation. For LSA, a value of 100 is recommended.
  • algorithm (string, default = “randomized”) – SVD solver to use. Either “arpack” for the ARPACK wrapper in SciPy (scipy.sparse.linalg.svds), or “randomized” for the randomized algorithm due to Halko (2009).
  • n_iter (int, optional (default 5)) – Number of iterations for randomized SVD solver. Not used by ARPACK. The default is larger than the default in ~sklearn.utils.extmath.randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.
  • tol (float, optional) – Tolerance for ARPACK. 0 means machine precision. Ignored by randomized SVD solver.
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.truncated_svd(n_components=2)
class aethos.stats.stats.Stats

Bases: object

anova(dep_var: str, num_variables=[], cat_variables=[], formula=None, verbose=False)

Runs an anova.

Anovas are to be used when one wants to compare the means of a condition between 2+ groups.

ANOVA tests if there is a difference in the mean somewhere in the model (testing if there was an overall effect), but it does not tell one where the difference is if the there is one.

Parameters:
  • dep_var (str) – Dependent variable you want to explore the relationship of
  • num_variables (list, optional) – Numeric variable columns, by default []
  • cat_variables (list, optional) – Categorical variable columns, by default []
  • formula (str, optional) – OLS formula statsmodel lib, by default None
  • verbose (bool, optional) – True to print OLS model summary and formula, by default False

Examples

>>> data.anova('dep_col', num_variables=['col1', 'col2'], verbose=True)
>>> data.anova('dep_col', cat_variables=['col1', 'col2'], verbose=True)
>>> data.anova('dep_col', num_variables=['col1', 'col2'], cat_variables=['col3'] verbose=True)
ind_ttest(group1: str, group2: str, equal_var=True, output_file=None)

Performs an Independent T test.

This is to be used when you want to compare the means of 2 groups.

If group 2 column name is not provided and there is a test set, it will compare the same column in the train and test set.

If there are any NaN’s they will be omitted.

Parameters:
  • group1 (str) – Column for group 1 to compare.
  • group2 (str, optional) – Column for group 2 to compare, by default None
  • equal_var (bool, optional) – If True (default), perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variance, by default True
  • output_file (str, optional) – Name of the file to output, by default None
Returns:

T test statistic, P value

Return type:

list

Examples

>>> data.ind_ttest('col1', 'col2')
>>> data.ind_ttest('col1', 'col2', output_file='ind_ttest.png')
ks_feature_distribution(threshold=0.1, show_plots=True)

Uses the Kolomogorov-Smirnov test see if the distribution in the training and test sets are similar.

Credit: https://www.kaggle.com/nanomathias/distribution-of-test-vs-training-data#1.-t-SNE-Distribution-Overview

Parameters:
  • threshold (float, optional) – KS statistic threshold, by default 0.1
  • show_plots (bool, optional) – True to show histograms of feature distributions, by default True
Returns:

Columns that are significantly different in the train and test set.

Return type:

DataFrame

Examples

>>> data.ks_feature_distribution()
>>> data.ks_feature_distribution(threshold=0.2)
most_common(col: str, n=15, plot=False, use_test=False, output_file='', **plot_kwargs)

Analyzes the most common values in the column and either prints them or displays a bar chart.

Parameters:
  • col (str) – Column to analyze
  • n (int, optional) – Number of top most common values to display, by default 15
  • plot (bool, optional) – True to plot a bar chart, by default False
  • use_test (bool, optional) – True to analyze the test set, by default False
  • output_file (str,) – File name to save plot as, IF plot=True

Examples

>>> data.most_common('col1', plot=True)
>>> data.most_common('col1', n=50, plot=True)
>>> data.most_common('col1', n=50)
onesample_ttest(group1: str, mean: Union[float, int], output_file=None)

Performs a One Sample t-test.

This is to be used when you want to compare the mean of a single group against a known mean.

If there are any NaN’s they will be omitted.

Parameters:
  • group1 (str) – Column for group 1 to compare.
  • mean (float, int, optional) – Sample mean to compare to.
  • output_file (str, optional) – Name of the file to output, by default None
Returns:

T test statistic, P value

Return type:

list

Examples

>>> data.onesample_ttest('col1', 1)
>>> data.onesample_ttest('col1', 1, output_file='ones_ttest.png')
paired_ttest(group1: str, group2=None, output_file=None)

Performs a Paired t-test.

This is to be used when you want to compare the means from the same group at different times.

If group 2 column name is not provided and there is a test set, it will compare the same column in the train and test set.

If there are any NaN’s they will be omitted.

Parameters:
  • group1 (str) – Column for group 1 to compare.
  • group2 (str, optional) – Column for group 2 to compare, by default None
  • equal_var (bool, optional) – If True (default), perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variance, by default True
  • output_file (str, optional) – Name of the file to output, by default None
Returns:

T test statistic, P value

Return type:

list

Examples

>>> data.paired_ttest('col1', 'col2')
>>> data.paired_ttest('col1', 'col2', output_file='pair_ttest.png')
predict_data_sample()

Identifies how similar the train and test set distribution are by trying to predict whether each sample belongs to the train or test set using Random Forest, 10 Fold Stratified Cross Validation.

The lower the F1 score, the more similar the distributions are as it’s harder to predict which sample belongs to which distribution.

Credit: https://www.kaggle.com/nanomathias/distribution-of-test-vs-training-data#1.-t-SNE-Distribution-Overview

Returns:Returns a deep copy of the Data object.
Return type:Data

Examples

>>> data.predict_data_sample()

Modelling Module

class aethos.modelling.model.ModelBase(x_train, target, x_test=None, test_split_percentage=0.2, exp_name='my-experiment')

Bases: object

Doc2Vec(col_name, prep=False, model_name='d2v', run=True, **kwargs)

The underlying assumption of Word2Vec is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model. For instance: “dog”, “puppy” and “pup” are often used in similar situations, with similar surrounding words like “good”, “fluffy” or “cute”, and according to Word2Vec they will therefore share a similar vector representation.

From this assumption, Word2Vec can be used to find out the relations between words in a dataset, compute the similarity between them, or use the vector representation of those words as input for other applications such as text classification or clustering.

For more information on word2vec, you can view it here https://radimrehurek.com/gensim/models/word2vec.html.

Parameters:
  • col_name (str, optional) – Column name of text data that you want to summarize
  • prep (bool, optional) – True to prep the data. Use when passing in raw text data. False if passing in text that is already prepped. By default False
  • model_name (str, optional) – Name for this model, default to model_extract_keywords_gensim
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • dm ({1,0}, optional) – Defines the training algorithm. If dm=1, ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.
  • vector_size (int, optional) – Dimensionality of the feature vectors.
  • window (int, optional) – The maximum distance between the current and predicted word within a sentence.
  • alpha (float, optional) – The initial learning rate.
  • min_alpha (float, optional) – Learning rate will linearly drop to min_alpha as training progresses.
  • min_count (int, optional) – Ignores all words with total frequency lower than this.
  • max_vocab_size (int, optional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
  • sample (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
  • workers (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).
  • epochs (int, optional) – Number of iterations (epochs) over the corpus.
  • hs ({1,0}, optional) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.
  • negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
  • ns_exponent (float, optional) – The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier suggest that other values may perform better for recommendation applications.
  • dm_mean ({1,0}, optional) – If 0 , use the sum of the context word vectors. If 1, use the mean. Only applies when dm is used in non-concatenative mode.
  • dm_concat ({1,0}, optional) – If 1, use concatenation of context vectors rather than sum/average; Note concatenation results in a much-larger model, as the input is no longer the size of one (sampled or arithmetically combined) word vector, but the size of the tag(s) and all words in the context strung together.
  • dm_tag_count (int, optional) – Expected constant number of document tags per document, when using dm_concat mode.
  • dbow_words ({1,0}, optional) – If set to 1 trains word-vectors (in skip-gram fashion) simultaneous with DBOW doc-vector training; If 0, only trains doc-vectors (faster).
  • trim_rule (function, optional) –

    Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during current method call and is not stored as part of the model.

    The input parameters are of the following types:

    word (str) - the word we are examining

    count (int) - the word’s frequency count in the corpus

    min_count (int) - the minimum count threshold.

Returns:

Resulting model

Return type:

TextModelAnalysis

Examples

>>> model.Doc2Vec('col1', prep=True)
>>> model.Doc2Vec('col1', run=False) # Add model to the queue
LDA(col_name, prep=False, model_name='lda', run=True, **kwargs)

Extracts topics from your data using Latent Dirichlet Allocation.

For more information on LDA, you can view it here https://radimrehurek.com/gensim/models/ldamodel.html.

Parameters:
  • col_name (str, optional) – Column name of text data that you want to summarize
  • prep (bool, optional) – True to prep the data. Use when passing in raw text data. False if passing in text that is already prepped. By default False
  • model_name (str, optional) – Name for this model, default to lda
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • num_topics ((int, optional)) – The number of requested latent topics to be extracted from the training corpus.
  • distributed ((bool, optional)) – Whether distributed computing should be used to accelerate training.
  • chunksize ((int, optional)) – Number of documents to be used in each training chunk.
  • passes ((int, optional)) – Number of passes through the corpus during training.
  • update_every ((int, optional)) – Number of documents to be iterated through for each update. Set to 0 for batch learning, > 1 for online iterative learning.
  • alpha (({numpy.ndarray, str}, optional)) –

    Can be set to an 1D array of length equal to the number of expected topics that expresses our a-priori belief for the each topics’ probability. Alternatively default prior selecting strategies can be employed by supplying a string:

    ’asymmetric’: Uses a fixed normalized asymmetric prior of 1.0 / topicno.

    ’auto’: Learns an asymmetric prior from the corpus (not available if distributed==True).

  • eta (({float, np.array, str}, optional)) –

    A-priori belief on word probability, this can be:

    scalar for a symmetric prior over topic/word probability,

    vector of length num_words to denote an asymmetric user defined probability for each word,

    matrix of shape (num_topics, num_words) to assign a probability for each word-topic combination,

    the string ‘auto’ to learn the asymmetric prior from the data.

  • decay ((float, optional)) – A number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten when each new document is examined. Corresponds to Kappa from Matthew D. Hoffman, David M. Blei, Francis Bach: “Online Learning for Latent Dirichlet Allocation NIPS’10”.
  • offset ((float, optional)) – Hyper-parameter that controls how much we will slow down the first steps the first few iterations. Corresponds to Tau_0 from Matthew D. Hoffman, David M. Blei, Francis Bach: “Online Learning for Latent Dirichlet Allocation NIPS’10”.
  • eval_every ((int, optional)) – Log perplexity is estimated every that many updates. Setting this to one slows down training by ~2x.
  • iterations ((int, optional)) – Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.
  • gamma_threshold ((float, optional)) – Minimum change in the value of the gamma parameters to continue iterating.
  • minimum_probability ((float, optional)) – Topics with a probability lower than this threshold will be filtered out.
  • random_state (({np.random.RandomState, int}, optional)) – Either a randomState object or a seed to generate one. Useful for reproducibility.
  • ns_conf ((dict of (str, object), optional)) – Key word parameters propagated to gensim.utils.getNS() to get a Pyro4 Nameserved. Only used if distributed is set to True.
  • minimum_phi_value ((float, optional)) – if per_word_topics is True, this represents a lower bound on the term probabilities.per_word_topics (bool) – If True, the model also computes a list of topics, sorted in descending order of most likely topics for each word, along with their phi values multiplied by the feature length (i.e. word count).
Returns:

Resulting model

Return type:

TextModelAnalysis

Examples

>>> model.LDA('col1', prep=True)
>>> model.LDA('col1', run=False) # Add model to the queue
Word2Vec(col_name, prep=False, model_name='w2v', run=True, **kwargs)

The underlying assumption of Word2Vec is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model. For instance: “dog”, “puppy” and “pup” are often used in similar situations, with similar surrounding words like “good”, “fluffy” or “cute”, and according to Word2Vec they will therefore share a similar vector representation.

From this assumption, Word2Vec can be used to find out the relations between words in a dataset, compute the similarity between them, or use the vector representation of those words as input for other applications such as text classification or clustering.

For more information on word2vec, you can view it here https://radimrehurek.com/gensim/models/word2vec.html.

Parameters:
  • col_name (str, optional) – Column name of text data that you want to summarize
  • prep (bool, optional) – True to prep the data. Use when passing in raw text data. False if passing in text that is already prepped. By default False
  • model_name (str, optional) – Name for this model, default to model_extract_keywords_gensim
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • size (int, optional) – Dimensionality of the word vectors.
  • window (int, optional) – Maximum distance between the current and predicted word within a sentence.
  • min_count (int, optional) – Ignores all words with total frequency lower than this.
  • int, optional (workers) – Use these many worker threads to train the model (=faster training with multicore machines).
  • sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.
  • hs ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.
  • negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
  • ns_exponent (float, optional) – The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, Caselles-Dupré, Lesaint, & Royo-Letelier suggest that other values may perform better for recommendation applications.
  • cbow_mean ({0, 1}, optional) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.
  • alpha (float, optional) – The initial learning rate.
  • min_alpha (float, optional) – Learning rate will linearly drop to min_alpha as training progresses.
  • max_vocab_size (int, optional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
  • max_final_vocab (int, optional) – Limits the vocab to a target vocab size by automatically picking a matching min_count. If the specified min_count is more than the calculated min_count, the specified min_count will be used. Set to None if not required.
  • sample (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
  • hashfxn (function, optional) – Hash function to use to randomly initialize weights, for increased training reproducibility.
  • workers (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).
  • iter (int, optional) – Number of iterations (epochs) over the corpus.
  • trim_rule (function, optional) –

    Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.

    The input parameters are of the following types:

    word (str) - the word we are examining

    count (int) - the word’s frequency count in the corpus

    min_count (int) - the minimum count threshold.

  • sorted_vocab ({0, 1}, optional) – If 1, sort the vocabulary by descending frequency before assigning word indexes. See sort_vocab().
  • batch_words (int, optional) – Target size (in words) for batches of examples passed to worker threads (and thus cython routines). (Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)
  • compute_loss (bool, optional) – If True, computes and stores loss value which can be retrieved using get_latest_training_loss().
Returns:

Resulting model

Return type:

TextModelAnalysis

Examples

>>> model.Word2Vec('col1', prep=True)
>>> model.Word2Vec('col1', run=False) # Add model to the queue
columns

Property to return columns in the dataset.

compare_models()

Compare different models across every known metric for that model.

Returns:Dataframe of every model and metrics associated for that model
Return type:Dataframe

Examples

>>> model.compare_models()
copy()

Returns deep copy of object.

Returns:Deep copy of object
Return type:Object
delete_model(name)

Deletes a model, specified by it’s name - can be viewed by calling list_models.

Will look in both queued and ran models and delete where it’s found.

Parameters:name (str) – Name of the model

Examples

>>> model.delete_model('model1')
extract_keywords_gensim(*list_args, list_of_cols=[], new_col_name='_extracted_keywords', model_name='model_extracted_keywords_gensim', run=True, **keyword_kwargs)

Extracts keywords using Gensim’s implementation of the Text Rank algorithm.

Get most ranked words of provided text and/or its combinations.

Parameters:
  • list_of_cols (list, optional) – Column name(s) of text data that you want to summarize
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default _extracted_keywords
  • model_name (str, optional) – Name for this model, default to model_extract_keywords_gensim
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • ratio (float, optional) – Number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary.
  • words (int, optional) – Number of returned words.
  • split (bool, optional) – If True, list of sentences will be returned. Otherwise joined strings will be returned.
  • scores (bool, optional) – Whether score of keyword.
  • pos_filter (tuple, optional) – Part of speech filters.
  • lemmatize (bool, optional) – If True - lemmatize words.
  • deacc (bool, optional) – If True - remove accentuation.
Returns:

Resulting model

Return type:

TextModelAnalysis

Examples

>>> model.extract_keywords_gensim('col1')
>>> model.extract_keywords_gensim('col1', run=False) # Add model to the queue
features

Features for modelling

help_debug()

Displays a tips for helping debugging model outputs and how to deal with over and underfitting.

Credit: Andrew Ng’s and his book Machine Learning Yearning

Examples

>>> model.help_debug()
list_models()

Prints out all queued and ran models.

Examples

>>> model.list_models()
pretrained_question_answer(context_col: str, question_col: str, model_type=None, new_col_name='qa', run=True)

Uses Huggingface’s pipeline to automatically run Q&A analysis on text.

The default model is ‘tf_distil_bert_for_question_answering_2’

Possible model types are:

  • bert-base-uncased
  • bert-large-uncased
  • bert-base-cased
  • bert-large-cased
  • bert-base-multilingual-uncased
  • bert-base-multilingual-cased
  • bert-base-chinese
  • bert-base-german-cased
  • bert-large-uncased-whole-word-masking
  • bert-large-cased-whole-word-masking
  • bert-large-uncased-whole-word-masking-finetuned-squad
  • bert-large-cased-whole-word-masking-finetuned-squad
  • bert-base-cased-finetuned-mrpc
  • bert-base-german-dbmdz-cased
  • bert-base-german-dbmdz-uncased
  • bert-base-japanese
  • bert-base-japanese-whole-word-masking
  • bert-base-japanese-char
  • bert-base-japanese-char-whole-word-masking
  • bert-base-finnish-cased-v1
  • bert-base-finnish-uncased-v1
  • openai-gpt
  • gpt2
  • gpt2-medium
  • gpt2-large
  • gpt2-xl
  • transfo-xl-wt103
  • xlnet-base-cased
  • xlnet-large-cased
  • xlm-mlm-en-2048
  • xlm-mlm-ende-1024
  • xlm-mlm-enfr-1024
  • xlm-mlm-enro-1024
  • xlm-mlm-xnli15-1024
  • xlm-mlm-tlm-xnli15-1024
  • xlm-clm-enfr-1024
  • xlm-clm-ende-1024
  • xlm-mlm-17-1280
  • xlm-mlm-100-1280
  • roberta-base
  • roberta-large
  • roberta-large-mnli
  • distilroberta-base
  • roberta-base-openai-detector
  • roberta-large-openai-detector
  • distilbert-base-uncased
  • distilbert-base-uncased-distilled-squad
  • distilgpt2
  • distilbert-base-german-cased
  • distilbert-base-multilingual-cased
  • ctrl
  • camembert-base
  • ALBERT
  • albert-base-v1
  • albert-large-v1
  • albert-xlarge-v1
  • albert-xxlarge-v1
  • albert-base-v2
  • albert-large-v2
  • albert-xlarge-v2
  • albert-xxlarge-v2
  • t5-small
  • t5-base
  • t5-large
  • t5-3B
  • t5-11B
  • xlm-roberta-base
  • xlm-roberta-large
Parameters:
  • context_col (str) – Column name that contains the context for the question
  • question_col (str) – Column name of the question
  • model_type (str, optional) – Type of model, by default None
  • new_col_name (str, optional) – New column name for the sentiment scores, by default “sent_score”
Returns:

Return type:

TF or PyTorch of model

Examples

>>> m.pretrained_question_answer('col1', 'col2')
>>> m.pretrained_question_answer('col1', 'col2' model_type='albert-base-v1')
pretrained_sentiment_analysis(col: str, model_type=None, new_col_name='sent_score', run=True)

Uses Huggingface’s pipeline to automatically run sentiment analysis on text.

The default model is ‘tf_distil_bert_for_sequence_classification_2’

Possible model types are:

  • bert-base-uncased
  • bert-large-uncased
  • bert-base-cased
  • bert-large-cased
  • bert-base-multilingual-uncased
  • bert-base-multilingual-cased
  • bert-base-chinese
  • bert-base-german-cased
  • bert-large-uncased-whole-word-masking
  • bert-large-cased-whole-word-masking
  • bert-large-uncased-whole-word-masking-finetuned-squad
  • bert-large-cased-whole-word-masking-finetuned-squad
  • bert-base-cased-finetuned-mrpc
  • bert-base-german-dbmdz-cased
  • bert-base-german-dbmdz-uncased
  • bert-base-japanese
  • bert-base-japanese-whole-word-masking
  • bert-base-japanese-char
  • bert-base-japanese-char-whole-word-masking
  • bert-base-finnish-cased-v1
  • bert-base-finnish-uncased-v1
  • openai-gpt
  • gpt2
  • gpt2-medium
  • gpt2-large
  • gpt2-xl
  • transfo-xl-wt103
  • xlnet-base-cased
  • xlnet-large-cased
  • xlm-mlm-en-2048
  • xlm-mlm-ende-1024
  • xlm-mlm-enfr-1024
  • xlm-mlm-enro-1024
  • xlm-mlm-xnli15-1024
  • xlm-mlm-tlm-xnli15-1024
  • xlm-clm-enfr-1024
  • xlm-clm-ende-1024
  • xlm-mlm-17-1280
  • xlm-mlm-100-1280
  • roberta-base
  • roberta-large
  • roberta-large-mnli
  • distilroberta-base
  • roberta-base-openai-detector
  • roberta-large-openai-detector
  • distilbert-base-uncased
  • distilbert-base-uncased-distilled-squad
  • distilgpt2
  • distilbert-base-german-cased
  • distilbert-base-multilingual-cased
  • ctrl
  • camembert-base
  • albert-base-v1
  • albert-large-v1
  • albert-xlarge-v1
  • albert-xxlarge-v1
  • albert-base-v2
  • albert-large-v2
  • albert-xlarge-v2
  • albert-xxlarge-v2
  • t5-small
  • t5-base
  • t5-large
  • t5-3B
  • t5-11B
  • xlm-roberta-base
  • xlm-roberta-large
Parameters:
  • col (str) – Column of text to get sentiment analysis
  • model_type (str, optional) – Type of model, by default None
  • new_col_name (str, optional) – New column name for the sentiment scores, by default “sent_score”
Returns:

Return type:

TF or PyTorch of model

Examples

>>> m.pretrained_sentiment_analysis('col1')
>>> m.pretrained_sentiment_analysis('col1', model_type='albert-base-v1')
run_models(method='parallel')

Runs all queued models.

The models can either be run one after the other (‘series’) or at the same time in parallel.

Parameters:method (str, optional) – How to run models, can either be in ‘series’ or in ‘parallel’, by default ‘parallel’

Examples

>>> model.run_models()
>>> model.run_models(method='series')
summarize_gensim(*list_args, list_of_cols=[], new_col_name='_summarized', model_name='model_summarize_gensim', run=True, **summarizer_kwargs)

Summarize bodies of text using Gensim’s Text Rank algorithm. Note that it uses a Text Rank variant as stated here: https://radimrehurek.com/gensim/summarization/summariser.html

The output summary will consist of the most representative sentences and will be returned as a string, divided by newlines.

Parameters:
  • list_of_cols (list, optional) – Column name(s) of text data that you want to summarize
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default _extracted_keywords
  • model_name (str, optional) – Name for this model, default to model_summarize_gensim
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • ratio (float, optional) – Number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary.
  • word_count (int or None, optional) – Determines how many words will the output contain. If both parameters are provided, the ratio will be ignored.
  • split (bool, optional) – If True, list of sentences will be returned. Otherwise joined strings will be returned.
Returns:

Resulting model

Return type:

TextModelAnalysis

Examples

>>> model.summarize_gensim('col1')
>>> model.summarize_gensim('col1', run=False) # Add model to the queue
test_data

Testing data used to evaluate models

to_pickle(name: str)

Writes model to a pickle file.

Parameters:name (str) – Name of the model

Examples

>>> m = Model(df)
>>> m.LogisticRegression()
>>> m.to_pickle('log_reg')
to_service(model_name: str, project_name: str)

Creates an app.py, requirements.txt and Dockerfile in ~/.aethos/projects and the necessary folder structure to run the model as a microservice.

Parameters:
  • model_name (str) – Name of the model to create a microservice of.
  • project_name (str) – Name of the project that you want to create.

Examples

>>> m = Model(df)
>>> m.LogisticRegression()
>>> m.to_service('log_reg', 'your_proj_name')
train_data

Training data used for modelling

y_test

Property function for the testing predictor variable

class aethos.modelling.classification_models.Classification(x_train, target, x_test=None, test_split_percentage=0.2, exp_name='my-experiment')

Bases: aethos.modelling.model.ModelBase, aethos.analysis.Analysis, aethos.cleaning.clean.Clean, aethos.preprocessing.preprocess.Preprocess, aethos.feature_engineering.feature.Feature, aethos.visualizations.visualizations.Visualizations, aethos.stats.stats.Stats

ADABoostClassification(cv_type=None, gridsearch=None, score='accuracy', model_name='ada_cls', run=True, verbose=1, **kwargs)

Trains an AdaBoost classification model.

An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

For more AdaBoost info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type ({kfold, strat-kfold}, Crossvalidation Generator, optional) – Cross validation method, by default None
  • gridsearch (dict, optional) – Parameters to gridsearch, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “ada_cls”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • base_estimator (object, optional (default=None)) – The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes. If None, then the base estimator is DecisionTreeClassifier(max_depth=1)
  • n_estimators (integer, optional (default=50)) – The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.
  • learning_rate (float, optional (default=1.)) – Learning rate shrinks the contribution of each classifier by learning_rate. There is a trade-off between learning_rate and n_estimators.
Returns:

ClassificationModelAnalysis object to view results and analyze results

Return type:

ClassificationModelAnalysis

Examples

>>> model.AdaBoostClassification()
>>> model.AdaBoostClassification(model_name='rc_1, learning_rate=0.001)
>>> model.AdaBoostClassification(cv_type='kfold')
>>> model.AdaBoostClassification(gridsearch={'n_estimators': [50, 100]}, cv_type='strat-kfold')
>>> model.AdaBoostClassification(run=False) # Add model to the queue
BaggingClassification(cv_type=None, gridsearch=None, score='accuracy', model_name='bag_cls', run=True, verbose=1, **kwargs)

Trains a Bagging classification model.

A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

For more Bagging Classifier info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type ({kfold, strat-kfold}, Crossvalidation Generator, optional) – Cross validation method, by default None
  • gridsearch (dict, optional) – Parameters to gridsearch, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “bag_cls”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • base_estimator (object or None, optional (default=None)) – The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.
  • n_estimators (int, optional (default=10)) – The number of base estimators in the ensemble.
  • max_samples (int or float, optional (default=1.0)) –

    The number of samples to draw from X to train each base estimator.

    If int, then draw max_samples samples. If float, then draw max_samples * X.shape[0] samples.
  • max_features (int or float, optional (default=1.0)) –

    The number of features to draw from X to train each base estimator.

    If int, then draw max_features features. If float, then draw max_features * X.shape[1] features.
  • bootstrap (boolean, optional (default=True)) – Whether samples are drawn with replacement. If False, sampling without replacement is performed.
  • bootstrap_features (boolean, optional (default=False)) – Whether features are drawn with replacement.
  • oob_score (bool, optional (default=False)) – Whether to use out-of-bag samples to estimate the generalization error.
Returns:

ClassificationModelAnalysis object to view results and analyze results

Return type:

ClassificationModelAnalysis

Examples

>>> model.BaggingClassification()
>>> model.BaggingClassification(model_name='m1', n_estimators=100)
>>> model.BaggingClassification(cv_type='kfold')
>>> model.BaggingClassification(gridsearch={'n_estimators':[100, 200]}, cv_type='strat-kfold')
>>> model.BaggingClassification(run=False) # Add model to the queue
BernoulliClassification(cv_type=None, gridsearch=None, score='accuracy', model_name='bern', run=True, verbose=1, **kwargs)

Trains a Bernoulli Naive Bayes classification model.

Like MultinomialNB, this classifier is suitable for discrete data. The difference is that while MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolean features.

For more Bernoulli Naive Bayes info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB and https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type (bool, optional) – If True run crossvalidation on the model, by default False.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “bern”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • alpha (float, optional (default=1.0)) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
  • binarize (float or None, optional (default=0.0)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
  • fit_prior (boolean, optional (default=True)) – Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
  • class_prior (array-like, size=[n_classes,], optional (default=None)) – Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
Returns:

ClassificationModelAnalysis object to view results and analyze results

Return type:

ClassificationModelAnalysis

Examples

>>> model.BernoulliClassification()
>>> model.BernoulliClassification(model_name='m1', binarize=0.5)
>>> model.BernoulliClassification(cv_type='kfold')
>>> model.BernoulliClassification(gridsearch={'fit_prior':[True, False]}, cv_type='strat-kfold')
>>> model.BernoulliClassification(run=False) # Add model to the queue
DecisionTreeClassification(cv_type=None, gridsearch=None, score='accuracy', model_name='dt_cls', run=True, verbose=1, **kwargs)

Trains a Decision Tree classification model.

For more Decision Tree info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “dt_cls”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • criterion (string, optional (default=”gini”)) – The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
  • splitter (string, optional (default=”best”)) – The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
  • max_depth (int or None, optional (default=None)) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  • min_samples_split (int, float, optional (default=2)) –

    The minimum number of samples required to split an internal node:

    If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
  • min_samples_leaf (int, float, optional (default=1)) –

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
  • max_features (int, float, string or None, optional (default=None)) –

    The number of features to consider when looking for the best split:

    If int, then consider max_features features at each split. If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.

    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

  • max_leaf_nodes (int or None, optional (default=None)) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
  • min_impurity_decrease (float, optional (default=0.)) –

    A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

    The weighted impurity decrease equation is the following:

    N_t / N * (impurity - N_t_R / N_t * right_impurity
    • N_t_L / N_t * left_impurity)

    where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

    N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

  • min_impurity_split (float, (default=1e-7)) – Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
  • class_weight (dict, list of dicts, “balanced” or None, default=None) –

    Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

    Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].

    The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

    For multi-output, the weights of each column of y will be multiplied.

    Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

  • presort (bool, optional (default=False)) – Whether to presort the data to speed up the finding of best splits in fitting. For the default settings of a decision tree on large datasets, setting this to true may slow down the training process. When using either a smaller dataset or a restricted depth, this may speed up the training.
  • ccp_alphanon-negative (float, optional (default=0.0)) – Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal Cost-Complexity Pruning for details.
Returns:

ClassificationModelAnalysis object to view results and analyze results

Return type:

ClassificationModelAnalysis

Examples

>>> model.DecisionTreeClassification()
>>> model.DecisionTreeClassification(model_name='m1', min_impurity_split=0.0003)
>>> model.DecisionTreeClassification(cv_type='kfold')
>>> model.DecisionTreeClassification(gridsearch={'min_impurity_split':[0.01, 0.02]}, cv_type='strat-kfold')
>>> model.DecisionTreeClassification(run=False) # Add model to the queue
GaussianClassification(cv_type=None, gridsearch=None, score='accuracy', model_name='gauss', run=True, verbose=1, **kwargs)

Trains a Gaussian Naive Bayes classification model.

For more Gaussian Naive Bayes info, you can view it here: https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type (bool, optional) – If True run crossvalidation on the model, by default False.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “gauss”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • priors (array-like, shape (n_classes,)) – Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
  • var_smoothing (float, optional (default=1e-9)) – Portion of the largest variance of all features that is added to variances for calculation stability.
Returns:

ClassificationModelAnalysis object to view results and analyze results

Return type:

ClassificationModelAnalysis

Examples

>>> model.GaussianClassification()
>>> model.GaussianClassification(model_name='m1', var_smooting=0.0003)
>>> model.GaussianClassification(cv_type='kfold')
>>> model.GaussianClassification(gridsearch={'var_smoothing':[0.01, 0.02]}, cv_type='strat-kfold')
>>> model.GaussianClassification(run=False) # Add model to the queue
GradientBoostingClassification(cv_type=None, gridsearch=None, score='accuracy', model_name='grad_cls', run=True, verbose=1, **kwargs)

Trains a Gradient Boosting classification model.

GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced.

For more Gradient Boosting Classifier info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “grad_cls”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • loss ({‘deviance’, ‘exponential’}, optional (default=’deviance’)) – loss function to be optimized. ‘deviance’ refers to deviance (= logistic regression) for classification with probabilistic outputs. For loss ‘exponential’ gradient boosting recovers the AdaBoost algorithm.
  • learning_rate (float, optional (default=0.1)) – learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.
  • n_estimators (int (default=100)) – The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.
  • subsample (float, optional (default=1.0)) – The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. Subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.
  • criterion (string, optional (default=”friedman_mse”)) – The function to measure the quality of a split. Supported criteria are “friedman_mse” for the mean squared error with improvement score by Friedman, “mse” for mean squared error, and “mae” for the mean absolute error. The default value of “friedman_mse” is generally the best as it can provide a better approximation in some cases.
  • min_samples_split (int, float, optional (default=2)) –

    The minimum number of samples required to split an internal node:

    If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
  • min_samples_leaf (int, float, optional (default=1)) –

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
  • max_depth (integer, optional (default=3)) – maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.
  • max_features (int, float, string or None, optional (default=None)) –

    The number of features to consider when looking for the best split:

    If int, then consider max_features features at each split. If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.

    Choosing max_features < n_features leads to a reduction of variance and an increase in bias.

    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

  • max_leaf_nodes (int or None, optional (default=None)) – Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
  • presort (bool or ‘auto’, optional (default=’auto’)) – Whether to presort the data to speed up the finding of best splits in fitting. Auto mode by default will use presorting on dense data and default to normal sorting on sparse data. Setting presort to true on sparse data will raise an error.
  • validation_fraction (float, optional, default 0.1) – The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if n_iter_no_change is set to an integer.
  • tol (float, optional, default 1e-4) – Tolerance for the early stopping. When the loss is not improving by at least tol for n_iter_no_change iterations (if set to a number), the training stops.
Returns:

ClassificationModelAnalysis object to view results and analyze results

Return type:

ClassificationModelAnalysis

Examples

>>> model.GradientBoostingClassification()
>>> model.GradientBoostingClassification(model_name='m1', n_estimators=100)
>>> model.GradientBoostingClassification(cv_type='kfold')
>>> model.GradientBoostingClassification(gridsearch={'n_estimators':[100, 200]}, cv_type='strat-kfold')
>>> model.GradientBoostingClassification(run=False) # Add model to the queue
LightGBMClassification(cv_type=None, gridsearch=None, score='accuracy', model_name='lgbm_cls', run=True, verbose=1, **kwargs)

Trains an LightGBM Classification Model.

LightGBM is a gradient boosting framework that uses a tree based learning algorithm.

Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.

For more LightGBM info, you can view it here: https://github.com/microsoft/LightGBM and https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “lgbm_cls”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • (string, optional (default='gbdt')) (boosting_type) – ‘gbdt’, traditional Gradient Boosting Decision Tree. ‘dart’, Dropouts meet Multiple Additive Regression Trees. ‘goss’, Gradient-based One-Side Sampling. ‘rf’, Random Forest.
  • (int, optional (default=31)) (num_leaves) – Maximum tree leaves for base learners.
  • (int, optional (default=-1)) (n_jobs) – Maximum tree depth for base learners, <=0 means no limit.
  • (float, optional (default=0.1)) (learning_rate) – Boosting learning rate. You can use callbacks parameter of fit method to shrink/adapt learning rate in training using reset_parameter callback. Note, that this will ignore the learning_rate argument in training.
  • (int, optional (default=100)) (n_estimators) – Number of boosted trees to fit.
  • (int, optional (default=200000)) (subsample_for_bin) – Number of samples for constructing bins.
  • (string, callable or None, optional (default=None)) (objective) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below). Default: ‘regression’ for LGBMRegressor, ‘binary’ or ‘multiclass’ for LGBMClassifier, ‘lambdarank’ for LGBMRanker.
  • (dict, 'balanced' or None, optional (default=None)) (class_weight) – Weights associated with classes in the form {class_label: weight}. Use this parameter only for multi-class classification task; for binary classification task you may use is_unbalance or scale_pos_weight parameters. Note, that the usage of all these parameters will result in poor estimates of the individual class probabilities. You may want to consider performing probability calibration (https://scikit-learn.org/stable/modules/calibration.html) of your model. The ‘balanced’ mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). If None, all classes are supposed to have weight one. Note, that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
  • (float, optional (default=0.)) (reg_lambda) – Minimum loss reduction required to make a further partition on a leaf node of the tree.
  • (float, optional (default=1e-3)) (min_child_weight) – Minimum sum of instance weight (hessian) needed in a child (leaf).
  • (int, optional (default=20)) (min_child_samples) – Minimum number of data needed in a child (leaf).
  • (float, optional (default=1.)) (colsample_bytree) – Subsample ratio of the training instance.
  • (int, optional (default=0)) (subsample_freq) – Frequence of subsample, <=0 means no enable.
  • (float, optional (default=1.)) – Subsample ratio of columns when constructing each tree.
  • (float, optional (default=0.)) – L1 regularization term on weights.
  • (float, optional (default=0.)) – L2 regularization term on weights.
  • (int or None, optional (default=None)) (random_state) – Random number seed. If None, default seeds in C++ code will be used.
  • (int, optional (default=-1)) – Number of parallel threads.
  • (bool, optional (default=True)) (silent) – Whether to print messages while running boosting.
  • (string, optional (default='split')) (importance_type) – The type of feature importance to be filled into feature_importances_. If ‘split’, result contains numbers of times the feature is used in a model. If ‘gain’, result contains total gains of splits which use the feature.
Returns:

ClassificationModelAnalysis object to view results and analyze results

Return type:

ClassificationModelAnalysis

Examples

>>> model.LightGBMClassification()
>>> model.LightGBMClassification(model_name='m1', reg_alpha=0.0003)
>>> model.LightGBMClassification(cv_type='kfold')
>>> model.LightGBMClassification(gridsearch={'reg_alpha':[0.01, 0.02]}, cv_type='strat-kfold')
>>> model.LightGBMClassification(run=False) # Add model to the queue
LinearSVC(cv_type=None, gridsearch=None, score='accuracy', model_name='linsvc', run=True, verbose=1, **kwargs)

Trains a Linear Support Vector classification model.

Supports multi classification.

Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples. This class supports both dense and sparse input and the multiclass support is handled according to a one-vs-the-rest scheme.

For more Support Vector info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “linsvc”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • penalty (string, ‘l1’ or ‘l2’ (default=’l2’)) – Specifies the norm used in the penalization. The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to coef_ vectors that are sparse.
  • loss (string, ‘hinge’ or ‘squared_hinge’ (default=’squared_hinge’)) – Specifies the loss function. ‘hinge’ is the standard SVM loss (used e.g. by the SVC class) while ‘squared_hinge’ is the square of the hinge loss.
  • dual (bool, (default=True)) – Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features.
  • tol (float, optional (default=1e-4)) – Tolerance for stopping criteria.
  • C (float, optional (default=1.0)) – Penalty parameter C of the error term.
  • multi_class (string, ‘ovr’ or ‘crammer_singer’ (default=’ovr’)) – Determines the multi-class strategy if y contains more than two classes. “ovr” trains n_classes one-vs-rest classifiers, while “crammer_singer” optimizes a joint objective over all classes. While crammer_singer is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If “crammer_singer” is chosen, the options loss, penalty and dual will be ignored.
  • fit_intercept (boolean, optional (default=True)) – Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be already centered).
  • intercept_scaling (float, optional (default=1)) – When self.fit_intercept is True, instance vector x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equals to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.
  • class_weight ({dict, ‘balanced’}, optional) – Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
  • max_iter (int, (default=1000)) – The maximum number of iterations to be run.
Returns:

ClassificationModelAnalysis object to view results and analyze results

Return type:

ClassificationModelAnalysis

Examples

>>> model.LinearSVC()
>>> model.LinearSVC(model_name='m1', C=0.0003)
>>> model.LinearSVC(cv_type='kfold')
>>> model.LinearSVC(gridsearch={'C':[0.01, 0.02]}, cv_type='strat-kfold')
>>> model.LinearSVC(run=False) # Add model to the queue
LogisticRegression(cv_type=None, gridsearch=None, score='accuracy', model_name='log_reg', run=True, verbose=1, **kwargs)

Trains a logistic regression model.

For more Logistic Regression info, you can view them here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

If running grid search, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type ({kfold, strat-kfold}, Crossvalidation Generator, optional) – Cross validation method, by default None
  • gridsearch (dict, optional) – Parameters to gridsearch, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “log_reg”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • penalty (str, ‘l1’, ‘l2’, ‘elasticnet’ or ‘none’, optional (default=’l2’)) – Used to specify the norm used in the penalization. The ‘newton-cg’, ‘sag’ and ‘lbfgs’ solvers support only l2 penalties. ‘elasticnet’ is only supported by the ‘saga’ solver. If ‘none’ (not supported by the liblinear solver), no regularization is applied.
  • tol (float, optional (default=1e-4)) – Tolerance for stopping criteria.
  • C (float, optional (default=1.0)) – Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
  • class_weight (dict or ‘balanced’, optional (default=None)) – Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
Returns:

ClassificationModelAnalysis object to view results and analyze results

Return type:

ClassificationModelAnalysis

Examples

>>> model.LogisticRegression()
>>> model.LogisticRegression(model_name='lg_1', C=0.001)
>>> model.LogisticRegression(cv_type='kfold')
>>> model.LogisticRegression(gridsearch={'C':[0.01, 0.02]}, cv_type='strat-kfold')
>>> model.LogisticRegression(run=False) # Add model to the queue
MultinomialClassification(cv_type=None, gridsearch=None, score='accuracy', model_name='multi', run=True, verbose=1, **kwargs)

Trains a Multinomial Naive Bayes classification model.

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

For more Multinomial Naive Bayes info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB and https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type (bool, optional) – If True run crossvalidation on the model, by default False.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “multi”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • alpha (float, optional (default=1.0)) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
  • fit_prior (boolean, optional (default=True)) – Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
  • class_prior (array-like, size (n_classes,), optional (default=None)) – Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
Returns:

ClassificationModelAnalysis object to view results and analyze results

Return type:

ClassificationModelAnalysis

Examples

>>> model.MultinomialClassification()
>>> model.MultinomialClassification(model_name='m1', alpha=0.0003)
>>> model.MultinomialClassification(cv_type='kfold')
>>> model.MultinomialClassification(gridsearch={'alpha':[0.01, 0.02]}, cv_type='strat-kfold')
>>> model.MultinomialClassification(run=False) # Add model to the queue
RandomForestClassification(cv_type=None, gridsearch=None, score='accuracy', model_name='rf_cls', run=True, verbose=1, **kwargs)

Trains a Random Forest classification model.

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

For more Random Forest info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “rf_cls”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • n_estimators (integer, optional (default=10)) – The number of trees in the forest.
  • criterion (string, optional (default=”gini”)) –

    The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

    Note: this parameter is tree-specific.

  • max_depth (integer or None, optional (default=None)) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  • min_samples_split (int, float, optional (default=2)) –

    The minimum number of samples required to split an internal node:

    If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
  • min_samples_leaf (int, float, optional (default=1)) –

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
  • max_features (int, float, string or None, optional (default=”auto”)) –

    The number of features to consider when looking for the best split:

    If int, then consider max_features features at each split. If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features) (same as “auto”). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.

    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

  • max_leaf_nodes (int or None, optional (default=None)) – Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
  • min_impurity_decrease (float, optional (default=0.)) –

    A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

    The weighted impurity decrease equation is the following:

    N_t / N * (impurity - N_t_R / N_t * right_impurity
    • N_t_L / N_t * left_impurity)

    where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

    N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

  • bootstrap (boolean, optional (default=True)) – Whether bootstrap samples are used when building trees. If False, the whole datset is used to build each tree.
  • oob_score (bool (default=False)) – Whether to use out-of-bag samples to estimate the generalization accuracy.
  • class_weight (dict, list of dicts, “balanced”, “balanced_subsample” or None, optional (default=None)) –

    Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)) The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown. For multi-output, the weights of each column of y will be multiplied.

    Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

  • ccp_alphanon-negative (float, optional (default=0.0)) – Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal Cost-Complexity Pruning for details.
Returns:

ClassificationModelAnalysis object to view results and analyze results

Return type:

ClassificationModelAnalysis

Examples

>>> model.RandomForestClassification()
>>> model.RandomForestClassification(model_name='m1', n_estimators=100)
>>> model.RandomForestClassification(cv_type='kfold')
>>> model.RandomForestClassification(gridsearch={'n_estimators':[100, 200]}, cv_type='strat-kfold')
>>> model.RandomForestClassification(run=False) # Add model to the queue
RidgeClassification(cv_type=None, gridsearch=None, score='accuracy', model_name='ridge_cls', run=True, verbose=1, **kwargs)

Trains a Ridge Classification model.

For more Ridge Regression parameters, you can view them here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type ({kfold, strat-kfold}, Crossvalidation Generator, optional) – Cross validation method, by default None
  • gridsearch (dict, optional) – Parameters to gridsearch, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “ridge_cls”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • alpha (float) – Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to C^-1 in other linear models such as LogisticRegression or LinearSVC.
  • fit_intercept (boolean) – Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (e.g. data is expected to be already centered).
  • normalize (boolean, optional, default False) – This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.
  • tol (float, optional (default=1e-4)) – Tolerance for stopping criteria.
  • class_weight (dict or ‘balanced’, optional (default=None)) – Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
Returns:

ClassificationModelAnalysis object to view results and analyze results

Return type:

ClassificationModelAnalysis

Examples

>>> model.RidgeClassification()
>>> model.RidgeClassification(model_name='rc_1, tol=0.001)
>>> model.RidgeClassification(cv_type='kfold')
>>> model.RidgeClassification(gridsearch={'alpha':[0.01, 0.02]}, cv_type='strat-kfold')
>>> model.RidgeClassification(run=False) # Add model to the queue
SGDClassification(cv_type=None, gridsearch=None, score='accuracy', model_name='sgd_cls', run=True, verbose=1, **kwargs)

Trains a Linear classifier (SVM, logistic regression, a.o.) with SGD training.

For more info please view it here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type ({kfold, strat-kfold}, Crossvalidation Generator, optional) – Cross validation method, by default None
  • gridsearch (dict, optional) – Parameters to gridsearch, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “sgd_cls”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • loss (str, default: ‘hinge’) –

    The loss function to be used. Defaults to ‘hinge’, which gives a linear SVM. The possible options are ‘hinge’, ‘log’, ‘modified_huber’, ‘squared_hinge’, ‘perceptron’, or a regression loss: ‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’.

    The ‘log’ loss gives logistic regression, a probabilistic classifier. ‘modified_huber’ is another smooth loss that brings tolerance to outliers as well as probability estimates. ‘squared_hinge’ is like hinge but is quadratically penalized. ‘perceptron’ is the linear loss used by the perceptron algorithm. The other losses are designed for regression but can be useful in classification as well; see SGDRegressor for a description.

  • penalty (str, ‘none’, ‘l2’, ‘l1’, or ‘elasticnet’) – The penalty (aka regularization term) to be used. Defaults to ‘l2’ which is the standard regularizer for linear SVM models. ‘l1’ and ‘elasticnet’ might bring sparsity to the model (feature selection) not achievable with ‘l2’.
  • alpha (float) – Constant that multiplies the regularization term. Defaults to 0.0001 Also used to compute learning_rate when set to ‘optimal’.
  • l1_ratio (float) – The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Defaults to 0.15.
  • fit_intercept (bool) – Whether the intercept should be estimated or not. If False, the data is assumed to be already centered. Defaults to True.
  • max_iter (int, optional (default=1000)) – The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit.
  • tol (float or None, optional (default=1e-3)) – The stopping criterion. If it is not None, the iterations will stop when (loss > best_loss - tol) for n_iter_no_change consecutive epochs.
  • shuffle (bool, optional) – Whether or not the training data should be shuffled after each epoch. Defaults to True.
  • epsilon (float) – Epsilon in the epsilon-insensitive loss functions; only if loss is ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’. For ‘huber’, determines the threshold at which it becomes less important to get the prediction exactly right. For epsilon-insensitive, any differences between the current prediction and the correct label are ignored if they are less than this threshold.
  • learning_rate (string, optional) –

    The learning rate schedule:

    ‘constant’:

    eta = eta0

    ‘optimal’: [default]

    eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed by Leon Bottou.

    ‘invscaling’:

    eta = eta0 / pow(t, power_t)

    ‘adaptive’:

    eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.
  • eta0 (double) – The initial learning rate for the ‘constant’, ‘invscaling’ or ‘adaptive’ schedules. The default value is 0.0 as eta0 is not used by the default schedule ‘optimal’.
  • power_t (double) – The exponent for inverse scaling learning rate [default 0.5].
  • early_stopping (bool, default=False) – Whether to use early stopping to terminate training when validation score is not improving. If set to True, it will automatically set aside a stratified fraction of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs.
  • validation_fraction (float, default=0.1) – The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True.
  • n_iter_no_change (int, default=5) – Number of iterations with no improvement to wait before early stopping.
  • class_weight (dict, {class_label: weight} or “balanced” or None, optional) –

    Preset for the class_weight fit parameter.

    Weights associated with classes. If not given, all classes are supposed to have weight one.

    The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

  • average (bool or int, optional) – When set to True, computes the averaged SGD weights and stores the result in the coef_ attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.
Returns:

ClassificationModelAnalysis object to view results and analyze results

Return type:

ClassificationModelAnalysis

Examples

>>> model.SGDClassification()
>>> model.SGDClassification(model_name='rc_1, tol=0.001)
>>> model.SGDClassification(cv_type='kfold')
>>> model.SGDClassification(gridsearch={'alpha':[0.01, 0.02]}, cv_type='strat-kfold')
>>> model.SGDClassification(run=False) # Add model to the queue
SVC(cv_type=None, gridsearch=None, score='accuracy', model_name='svc_cls', run=True, verbose=1, **kwargs)

Trains a C-Support Vector classification model.

Supports multi classification.

The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using model.linearsvc or model.sgd_classification instead

The multiclass support is handled according to a one-vs-one scheme.

For more Support Vector info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “linsvc_cls”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • C (float, optional (default=1.0)) – Penalty parameter C of the error term.
  • kernel (string, optional (default=’rbf’)) – Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).
  • degree (int, optional (default=3)) – Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
  • gamma (float, optional (default=’auto’)) – Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Current default is ‘auto’ which uses 1 / n_features, if gamma=’scale’ is passed then it uses 1 / (n_features * X.var()) as value of gamma.
  • coef0 (float, optional (default=0.0)) – Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.
  • shrinking (boolean, optional (default=True)) – Whether to use the shrinking heuristic.
  • probability (boolean, optional (default=False)) – Whether to enable probability estimates. This must be enabled prior to calling fit, and will slow down that method.
  • tol (float, optional (default=1e-3)) – Tolerance for stopping criterion.
  • cache_size (float, optional) – Specify the size of the kernel cache (in MB).
  • class_weight ({dict, ‘balanced’}, optional) – Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
  • max_iter (int, optional (default=-1)) – Hard limit on iterations within solver, or -1 for no limit.
  • decision_function_shape (‘ovo’, ‘ovr’, default=’ovr’) – Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2). However, one-vs-one (‘ovo’) is always used as multi-class strategy.
Returns:

ClassificationModelAnalysis object to view results and analyze results

Return type:

ClassificationModelAnalysis

Examples

>>> model.SVC()
>>> model.SVC(model_name='m1', C=0.0003)
>>> model.SVC(cv_type='kfold')
>>> model.SVC(gridsearch={'C':[0.01, 0.02]}, cv_type='strat-kfold')
>>> model.SVC(run=False) # Add model to the queue
XGBoostClassification(cv_type=None, gridsearch=None, score='accuracy', model_name='xgb_cls', run=True, verbose=1, **kwargs)

Trains an XGBoost Classification Model.

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

For more XGBoost info, you can view it here: https://xgboost.readthedocs.io/en/latest/ and https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst.

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’
Parameters:
  • cv_type (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
  • model_name (str, optional) – Name for this model, by default “xgb_cls”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • max_depth (int) – Maximum tree depth for base learners. By default 3
  • learning_rate (float) – Boosting learning rate (xgb’s “eta”). By default 0.1
  • n_estimators (int) – Number of trees to fit. By default 100.
  • objective (string or callable) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below). By default binary:logistic for binary classification or multi:softprob for multiclass classification
  • booster (string) – Specify which booster to use: gbtree, gblinear or dart. By default ‘gbtree’
  • tree_method (string) – Specify which tree method to use If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from parameters document. By default ‘auto’
  • gamma (float) – Minimum loss reduction required to make a further partition on a leaf node of the tree. By default 0
  • subsample (float) – Subsample ratio of the training instance. By default 1
  • reg_alpha (float (xgb's alpha)) – L1 regularization term on weights. By default 0
  • reg_lambda (float (xgb's lambda)) – L2 regularization term on weights. By default 1
  • scale_pos_weight (float) – Balancing of positive and negative weights. By default 1
  • base_score – The initial prediction score of all instances, global bias. By default 0
  • missing (float, optional) – Value in the data which needs to be present as a missing value. If None, defaults to np.nan. By default, None
  • num_parallel_tree (int) – Used for boosting random forest. By default 1
  • importance_type (string, default "gain") – The feature importance type for the feature_importances_ property: either “gain”, “weight”, “cover”, “total_gain” or “total_cover”. By default ‘gain’.

Note

A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

y_true: array_like of shape [n_samples]
The target values
y_pred: array_like of shape [n_samples]
The predicted values
grad: array_like of shape [n_samples]
The value of the gradient for each sample point.
hess: array_like of shape [n_samples]
The value of the second derivative for each sample point
Returns:ClassificationModelAnalysis object to view results and analyze results
Return type:ClassificationModelAnalysis

Examples

>>> model.XGBoostClassification()
>>> model.XGBoostClassification(model_name='m1', reg_alpha=0.0003)
>>> model.XGBoostClassification(cv_type='kfold')
>>> model.XGBoostClassification(gridsearch={'reg_alpha':[0.01, 0.02]}, cv_type='strat-kfold')
>>> model.XGBoostClassification(run=False) # Add model to the queue
class aethos.modelling.regression_models.Regression(x_train, target, x_test=None, test_split_percentage=0.2, exp_name='my-experiment')

Bases: aethos.modelling.model.ModelBase, aethos.analysis.Analysis, aethos.cleaning.clean.Clean, aethos.preprocessing.preprocess.Preprocess, aethos.feature_engineering.feature.Feature, aethos.visualizations.visualizations.Visualizations, aethos.stats.stats.Stats

ADABoostRegression(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='ada_reg', run=True, verbose=1, **kwargs)

Trains an AdaBoost Regression model.

An AdaBoost classifier is a meta-estimator that begins by fitting a regressor on the original dataset and then fits additional copies of the regressor on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent regressors focus more on difficult cases.

For more AdaBoost info, you can view it here:https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html#sklearn.ensemble.AdaBoostRegressor

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • gridsearch (dict, optional) – Parameters to gridsearch, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name (str, optional) – Name for this model, by default “ada_reg”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • base_estimator (object, optional (default=None)) – The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes. If None, then the base estimator is DecisionTreeRegressor(max_depth=3)
  • n_estimators (integer, optional (default=50)) – The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.
  • learning_rate (float, optional (default=1.)) – Learning rate shrinks the contribution of each classifier by learning_rate. There is a trade-off between learning_rate and n_estimators.
  • loss ({‘linear’, ‘square’, ‘exponential’}, optional (default=’linear’)) – The loss function to use when updating the weights after each boosting iteration.
Returns:

RegressionModelAnalysis object to view results and analyze results

Return type:

RegressionModelAnalysis

Examples

>>> model.AdaBoostRegression()
>>> model.AdaBoostRegression(model_name='m1', learning_rate=0.0003)
>>> model.AdaBoostRegression(cv=10)
>>> model.AdaBoostRegression(gridsearch={'learning_rate':[0.01, 0.02]}, cv='strat-kfold')
>>> model.AdaBoostRegression(run=False) # Add model to the queue
BaggingRegression(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='bag_reg', run=True, verbose=1, **kwargs)

Trains a Bagging Regressor model.

A Bagging classifier is an ensemble meta-estimator that fits base regressors each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

For more Bagging Classifier info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressor

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • gridsearch (dict, optional) – Parameters to gridsearch, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name (str, optional) – Name for this model, by default “bag_reg”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • base_estimator (object or None, optional (default=None)) – The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.
  • n_estimators (int, optional (default=10)) – The number of base estimators in the ensemble.
  • max_samples (int or float, optional (default=1.0)) –

    The number of samples to draw from X to train each base estimator.

    If int, then draw max_samples samples. If float, then draw max_samples * X.shape[0] samples.
  • max_features (int or float, optional (default=1.0)) –

    The number of features to draw from X to train each base estimator.

    If int, then draw max_features features. If float, then draw max_features * X.shape[1] features.
  • bootstrap (boolean, optional (default=True)) – Whether samples are drawn with replacement. If False, sampling without replacement is performed.
  • bootstrap_features (boolean, optional (default=False)) – Whether features are drawn with replacement.
  • oob_score (bool, optional (default=False)) – Whether to use out-of-bag samples to estimate the generalization error.
Returns:

RegressionModelAnalysis object to view results and analyze results

Return type:

RegressionModelAnalysis

Examples

>>> model.BaggingRegression()
>>> model.BaggingRegression(model_name='m1', n_estimators=100)
>>> model.BaggingRegression(cv=10)
>>> model.BaggingRegression(gridsearch={'n_estimators':[100, 200]}, cv='strat-kfold')
>>> model.BaggingRegression(run=False) # Add model to the queue
BayesianRidgeRegression(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='bayridge_reg', run=True, verbose=1, **kwargs)

Trains a Bayesian Ridge Regression model.

For more Linear Regression info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html#sklearn.linear_model.BayesianRidge and https://scikit-learn.org/stable/modules/linear_model.html#bayesian-regression

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name – Name for this model, by default “bayridge_reg”
run : bool, optional
Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
verbose : int, optional
Verbosity level of model output, the higher the number - the more verbose. By default, 1
n_iter : int, optional
Maximum number of iterations. Default is 300. Should be greater than or equal to 1.
tol : float, optional
Stop the algorithm if w has converged. Default is 1.e-3.
alpha_1 : float, optional
Hyper-parameter : shape parameter for the Gamma distribution prior over the alpha parameter. Default is 1.e-6
alpha_2 : float, optional
Hyper-parameter : inverse scale parameter (rate parameter) for the Gamma distribution prior over the alpha parameter. Default is 1.e-6.
lambda_1 : float, optional
Hyper-parameter : shape parameter for the Gamma distribution prior over the lambda parameter. Default is 1.e-6.
lambda_2 : float, optional
Hyper-parameter : inverse scale parameter (rate parameter) for the Gamma distribution prior over the lambda parameter. Default is 1.e-6
fit_intercept : boolean, optional, default True
Whether to calculate the intercept for this model. The intercept is not treated as a probabilistic parameter and thus has no associated variance. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).
normalize : boolean, optional, default False
This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.
Returns:RegressionModelAnalysis object to view results and analyze results
Return type:RegressionModelAnalysis

Examples

>>> model.BayesianRidgeRegression()
>>> model.BayesianRidgeRegression(model_name='alpha_1', C=0.0003)
>>> model.BayesianRidgeRegression(cv=10)
>>> model.BayesianRidgeRegression(gridsearch={'alpha_2':[0.01, 0.02]}, cv='strat-kfold')
>>> model.BayesianRidgeRegression(run=False) # Add model to the queue
DecisionTreeRegression(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='dt_reg', run=True, verbose=1, **kwargs)

Trains a Decision Tree Regression model.

For more Decision Tree info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name (str, optional) – Name for this model, by default “dt_reg”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • criterion (string, optional (default=”mse”)) –

    The function to measure the quality of a split.

    Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node,
    “friedman_mse”, which uses mean squared error with Friedman’s improvement score for potential splits, and “mae” for the mean absolute error, which minimizes the L1 loss using the median of each terminal node.
  • splitter (string, optional (default=”best”)) – The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
  • max_depth (int or None, optional (default=None)) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  • min_samples_split (int, float, optional (default=2)) –

    The minimum number of samples required to split an internal node:

    If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
  • min_samples_leaf (int, float, optional (default=1)) –

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
  • max_features (int, float, string or None, optional (default=None)) –

    The number of features to consider when looking for the best split:

    If int, then consider max_features features at each split. If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.

    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

  • max_leaf_nodes (int or None, optional (default=None)) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
  • min_impurity_decrease (float, optional (default=0.)) –

    A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

    The weighted impurity decrease equation is the following:

    N_t / N * (impurity - N_t_R / N_t * right_impurity
    • N_t_L / N_t * left_impurity)

    where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

    N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

  • min_impurity_split (float, (default=1e-7)) – Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
  • presort (bool, optional (default=False)) – Whether to presort the data to speed up the finding of best splits in fitting. For the default settings of a decision tree on large datasets, setting this to true may slow down the training process. When using either a smaller dataset or a restricted depth, this may speed up the training.
  • ccp_alphanon-negative (float, optional (default=0.0)) – Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal Cost-Complexity Pruning for details.
Returns:

RegressionModelAnalysis object to view results and analyze results

Return type:

RegressionModelAnalysis

Examples

>>> model.DecisionTreeRegression()
>>> model.DecisionTreeRegression(model_name='m1', min_impurity_split=0.0003)
>>> model.DecisionTreeRegression(cv=10)
>>> model.DecisionTreeRegression(gridsearch={'min_impurity_split':[0.01, 0.02]}, cv='strat-kfold')
>>> model.DecisionTreeRegression(run=False) # Add model to the queue
ElasticnetRegression(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='elastic', run=True, verbose=1, **kwargs)

Elastic Net regression with combined L1 and L2 priors as regularizer.

For more Linear Regression info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name (str, optional) – Name for this model, by default “elastic”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • alpha (float, optional) – Constant that multiplies the penalty terms. Defaults to 1.0. See the notes for the exact mathematical meaning of this parameter. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.
  • l1_ratio (float) – The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
  • fit_intercept (bool) – Whether the intercept should be estimated or not. If False, the data is assumed to be already centered.
  • normalize (boolean, optional, default False) – This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use sklearn.preprocessing.
  • precompute (True | False | array-like) – Whether to use a precomputed Gram matrix to speed up calculations. The Gram matrix can also be passed as argument. For sparse input this option is always True to preserve sparsity.
  • max_iter (int, optional) – The maximum number of iterations
  • tol (float, optional) – The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
  • positive (bool, optional) – When set to True, forces the coefficients to be positive.
  • selection (str, default ‘cyclic’) – If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.
Returns:

RegressionModelAnalysis object to view results and analyze results

Return type:

RegressionModelAnalysis

Examples

>>> model.ElasticNetRegression()
>>> model.ElasticNetRegression(model_name='m1', alpha=0.0003)
>>> model.ElasticNetRegression(cv=10)
>>> model.ElasticNetRegression(gridsearch={'alpha':[0.01, 0.02]}, cv='strat-kfold')
>>> model.ElasticNetRegression(run=False) # Add model to the queue
GradientBoostingRegression(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='grad_reg', run=True, verbose=1, **kwargs)

Trains a Gradient Boosting regression model.

GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function.

For more Gradient Boosting Classifier info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name (str, optional) – Name for this model, by default “grad_reg”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • loss ({‘ls’, ‘lad’, ‘huber’, ‘quantile’}, optional (default=’ls’)) –

    loss function to be optimized.

    ‘ls’ refers to least squares regression. ‘lad’ (least absolute deviation) is a highly robust loss function solely based on order information of the input variables. ‘huber’ is a combination of the two. ‘quantile’ allows quantile regression (use alpha to specify the quantile).

  • learning_rate (float, optional (default=0.1)) – learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.
  • n_estimators (int (default=100)) – The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.
  • subsample (float, optional (default=1.0)) – The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. Subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.
  • criterion (string, optional (default=”friedman_mse”)) – The function to measure the quality of a split. Supported criteria are “friedman_mse” for the mean squared error with improvement score by Friedman, “mse” for mean squared error, and “mae” for the mean absolute error. The default value of “friedman_mse” is generally the best as it can provide a better approximation in some cases.
  • min_samples_split (int, float, optional (default=2)) –

    The minimum number of samples required to split an internal node:

    If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
  • min_samples_leaf (int, float, optional (default=1)) –

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
  • max_depth (integer, optional (default=3)) – maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.
  • max_features (int, float, string or None, optional (default=None)) –

    The number of features to consider when looking for the best split:

    If int, then consider max_features features at each split. If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.

    Choosing max_features < n_features leads to a reduction of variance and an increase in bias.

    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

  • alpha (float (default=0.9)) – The alpha-quantile of the huber loss function and the quantile loss function. Only if loss=’huber’ or loss=’quantile’.
  • max_leaf_nodes (int or None, optional (default=None)) – Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
  • presort (bool or ‘auto’, optional (default=’auto’)) – Whether to presort the data to speed up the finding of best splits in fitting. Auto mode by default will use presorting on dense data and default to normal sorting on sparse data. Setting presort to true on sparse data will raise an error.
  • validation_fraction (float, optional, default 0.1) – The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if n_iter_no_change is set to an integer.
  • tol (float, optional, default 1e-4) – Tolerance for the early stopping. When the loss is not improving by at least tol for n_iter_no_change iterations (if set to a number), the training stops.
Returns:

RegressionModelAnalysis object to view results and analyze results

Return type:

RegressionModelAnalysis

Examples

>>> model.GradientBoostingRegression()
>>> model.GradientBoostingRegression(model_name='m1', alpha=0.0003)
>>> model.GradientBoostingRegression(cv=10)
>>> model.GradientBoostingRegression(gridsearch={'alpha':[0.01, 0.02]}, cv='strat-kfold')
>>> model.GradientBoostingRegression(run=False) # Add model to the queue
LassoRegression(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='lasso', run=True, verbose=1, **kwargs)

Lasso Regression Model trained with L1 prior as regularizer (aka the Lasso)

Technically the Lasso model is optimizing the same objective function as the Elastic Net with l1_ratio=1.0 (no L2 penalty).

For more Lasso Regression info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name (str, optional) – Name for this model, by default “lasso”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • alpha (float, optional) – Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.
  • fit_intercept (boolean, optional, default True) – Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).
  • normalize (boolean, optional, default False) – This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.
  • precompute (True | False | array-like, default=False) – Whether to use a precomputed Gram matrix to speed up calculations. If set to ‘auto’ let us decide. The Gram matrix can also be passed as argument. For sparse input this option is always True to preserve sparsity.
  • max_iter (int, optional) – The maximum number of iterations
  • tol (float, optional) –
    The tolerance for the optimization:
    if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
  • positive (bool, optional) – When set to True, forces the coefficients to be positive.
  • selection (str, default ‘cyclic’) – If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.
Returns:

RegressionModelAnalysis object to view results and analyze results

Return type:

RegressionModelAnalysis

Examples

>>> model.LassoRegression()
>>> model.LassoRegression(model_name='m1', alpha=0.0003)
>>> model.LassoRegression(cv=10)
>>> model.LassoRegression(gridsearch={'alpha':[0.01, 0.02]}, cv='strat-kfold')
>>> model.LassoRegression(run=False) # Add model to the queue
LightGBMRegression(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='lgbm_reg', run=True, verbose=1, **kwargs)

Trains an LightGBM Regression Model.

LightGBM is a gradient boosting framework that uses a tree based learning algorithm.

Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.

For more LightGBM info, you can view it here: https://github.com/microsoft/LightGBM and https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html#lightgbm.LGBMRegressor

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name (str, optional) – Name for this model, by default “lgbm_reg”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • (string, optional (default='gbdt')) (boosting_type) – ‘gbdt’, traditional Gradient Boosting Decision Tree. ‘dart’, Dropouts meet Multiple Additive Regression Trees. ‘goss’, Gradient-based One-Side Sampling. ‘rf’, Random Forest.
  • (int, optional (default=31)) (num_leaves) – Maximum tree leaves for base learners.
  • (int, optional (default=-1)) (n_jobs) – Maximum tree depth for base learners, <=0 means no limit.
  • (float, optional (default=0.1)) (learning_rate) – Boosting learning rate. You can use callbacks parameter of fit method to shrink/adapt learning rate in training using reset_parameter callback. Note, that this will ignore the learning_rate argument in training.
  • (int, optional (default=100)) (n_estimators) – Number of boosted trees to fit.
  • (int, optional (default=200000)) (subsample_for_bin) – Number of samples for constructing bins.
  • (string, callable or None, optional (default=None)) (objective) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below). Default: ‘regression’ for LGBMRegressor, ‘binary’ or ‘multiclass’ for LGBMClassifier, ‘lambdarank’ for LGBMRanker.
  • (dict, 'balanced' or None, optional (default=None)) (class_weight) – Weights associated with classes in the form {class_label: weight}. Use this parameter only for multi-class classification task; for binary classification task you may use is_unbalance or scale_pos_weight parameters. Note, that the usage of all these parameters will result in poor estimates of the individual class probabilities. You may want to consider performing probability calibration (https://scikit-learn.org/stable/modules/calibration.html) of your model. The ‘balanced’ mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). If None, all classes are supposed to have weight one. Note, that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
  • (float, optional (default=0.)) (reg_lambda) – Minimum loss reduction required to make a further partition on a leaf node of the tree.
  • (float, optional (default=1e-3)) (min_child_weight) – Minimum sum of instance weight (hessian) needed in a child (leaf).
  • (int, optional (default=20)) (min_child_samples) – Minimum number of data needed in a child (leaf).
  • (float, optional (default=1.)) (colsample_bytree) – Subsample ratio of the training instance.
  • (int, optional (default=0)) (subsample_freq) – Frequence of subsample, <=0 means no enable.
  • (float, optional (default=1.)) – Subsample ratio of columns when constructing each tree.
  • (float, optional (default=0.)) – L1 regularization term on weights.
  • (float, optional (default=0.)) – L2 regularization term on weights.
  • (int or None, optional (default=None)) (random_state) – Random number seed. If None, default seeds in C++ code will be used.
  • (int, optional (default=-1)) – Number of parallel threads.
  • (bool, optional (default=True)) (silent) – Whether to print messages while running boosting.
  • (string, optional (default='split')) (importance_type) – The type of feature importance to be filled into feature_importances_. If ‘split’, result contains numbers of times the feature is used in a model. If ‘gain’, result contains total gains of splits which use the feature.
Returns:

RegressionModelAnalysis object to view results and analyze results

Return type:

RegressionModelAnalysis

Examples

>>> model.LightGBMRegression()
>>> model.LightGBMRegression(model_name='m1', reg_lambda=0.0003)
>>> model.LightGBMRegression(cv=10)
>>> model.LightGBMRegression(gridsearch={'reg_lambda':[0.01, 0.02]}, cv='strat-kfold')
>>> model.LightGBMRegression(run=False) # Add model to the queue
LinearRegression(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='lin_reg', run=True, verbose=1, **kwargs)

Trains a Linear Regression.

For more Linear Regression info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name (str, optional) – Name for this model, by default “lin_reg”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • fit_intercept (boolean, optional, default True) – whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).
  • normalize (boolean, optional, default False) – This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.
Returns:

RegressionModelAnalysis object to view results and analyze results

Return type:

RegressionModelAnalysis

Examples

>>> model.LinearRegression()
>>> model.LinearRegression(model_name='m1', normalize=True)
>>> model.LinearRegression(cv=10)
>>> model.LinearRegression(gridsearch={'normalize':[True, False]}, cv='strat-kfold')
>>> model.LinearRegression(run=False) # Add model to the queue
LinearSVR(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='linsvr', run=True, verbose=1, **kwargs)

Trains a Linear Support Vector Regression model.

Similar to SVR with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

For more Support Vector info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name (str, optional) – Name for this model, by default “linsvr_cls”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • epsilon (float, optional (default=0.0)) – Epsilon parameter in the epsilon-insensitive loss function. Note that the value of this parameter depends on the scale of the target variable y. If unsure, set epsilon=0.
  • tol (float, optional (default=1e-4)) – Tolerance for stopping criteria.
  • C (float, optional (default=1.0)) – Penalty parameter C of the error term.
  • loss (string, ‘hinge’ or ‘squared_hinge’ (default=’squared_hinge’)) – Specifies the loss function. ‘hinge’ is the standard SVM loss (used e.g. by the SVC class) while ‘squared_hinge’ is the square of the hinge loss.
  • dual (bool, (default=True)) – Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features.
  • fit_intercept (boolean, optional (default=True)) – Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be already centered).
  • intercept_scaling (float, optional (default=1)) – When self.fit_intercept is True, instance vector x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equals to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.
  • max_iter (int, (default=1000)) – The maximum number of iterations to be run.
Returns:

RegressionModelAnalysis object to view results and analyze results

Return type:

RegressionModelAnalysis

Examples

>>> model.LinearSVR()
>>> model.LinearSVR(model_name='m1', C=0.0003)
>>> model.LinearSVR(cv=10)
>>> model.LinearSVR(gridsearch={'C':[0.01, 0.02]}, cv='strat-kfold')
>>> model.LinearSVR(run=False) # Add model to the queue
RandomForestRegression(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='rf_reg', run=True, verbose=1, **kwargs)

Trains a Random Forest Regression model.

A random forest is a meta estimator that fits a number of decision tree regressors on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

For more Random Forest info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name (str, optional) – Name for this model, by default “rf_reg”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • n_estimators (integer, optional (default=10)) – The number of trees in the forest.
  • criterion (string, optional (default=”mse”)) – The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.
  • max_depth (integer or None, optional (default=None)) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
  • min_samples_split (int, float, optional (default=2)) –

    The minimum number of samples required to split an internal node:

    If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
  • min_samples_leaf (int, float, optional (default=1)) –

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
  • max_features (int, float, string or None, optional (default=”auto”)) –

    The number of features to consider when looking for the best split:

    If int, then consider max_features features at each split. If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features) (same as “auto”). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.

    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

  • max_leaf_nodes (int or None, optional (default=None)) – Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
  • min_impurity_decrease (float, optional (default=0.)) –

    A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

    The weighted impurity decrease equation is the following:

    N_t / N * (impurity - N_t_R / N_t * right_impurity
    • N_t_L / N_t * left_impurity)

    where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

    N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

  • bootstrap (boolean, optional (default=True)) – Whether bootstrap samples are used when building trees. If False, the whole datset is used to build each tree.
  • oob_score (bool (default=False)) – Whether to use out-of-bag samples to estimate the generalization accuracy.
  • ccp_alphanon-negative (float, optional (default=0.0)) – Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal Cost-Complexity Pruning for details.
Returns:

RegressionModelAnalysis object to view results and analyze results

Return type:

RegressionModelAnalysis

Examples

>>> model.RandomForestRegression()
>>> model.RandomForestRegression(model_name='m1', n_estimators=100)
>>> model.RandomForestRegression(cv=10)
>>> model.RandomForestRegression(gridsearch={'n_estimators':[100, 200]}, cv='strat-kfold')
>>> model.RandomForestRegression(run=False) # Add model to the queue
RidgeRegression(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='ridge_reg', run=True, verbose=1, **kwargs)

Trains a Ridge Regression model.

For more Ridge Regression info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name (str, optional) – Name for this model, by default “ridge”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • alpha ({float, array-like}, shape (n_targets)) – Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to C^-1 in other linear models such as LogisticRegression or LinearSVC. If an array is passed, penalties are assumed to be specific to the targets. Hence they must correspond in number.
  • fit_intercept (boolean) – Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (e.g. data is expected to be already centered).
  • normalize (boolean, optional, default False) – This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm.
  • max_iter (int, optional) – Maximum number of iterations for conjugate gradient solver.
  • tol (float) – Precision of the solution.
Returns:

RegressionModelAnalysis object to view results and analyze results

Return type:

RegressionModelAnalysis

Examples

>>> model.RidgeRegression()
>>> model.RidgeRegression(model_name='m1', alpha=0.0003)
>>> model.RidgeRegression(cv=10)
>>> model.RidgeRegression(gridsearch={'alpha':[0.01, 0.02]}, cv='strat-kfold')
>>> model.RidgeRegression(run=False) # Add model to the queue
SGDRegression(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='sgd_reg', run=True, verbose=1, **kwargs)

Trains a SGD Regression model.

Linear model fitted by minimizing a regularized empirical loss with SGD

SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate).

The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If the parameter update crosses the 0.0 value because of the regularizer, the update is truncated to 0.0 to allow for learning sparse models and achieve online feature selection.

For more SGD Regression info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name (str, optional) – Name for this model, by default “sgd_reg”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • loss (str, default: ‘squared_loss’) –

    The loss function to be used.

    The possible values are ‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’

    The ‘squared_loss’ refers to the ordinary least squares fit. ‘huber’ modifies ‘squared_loss’ to focus less on getting outliers correct by switching from squared to linear loss past a distance of epsilon. ‘epsilon_insensitive’ ignores errors less than epsilon and is linear past that; this is the loss function used in SVR. ‘squared_epsilon_insensitive’ is the same but becomes squared loss past a tolerance of epsilon.

  • penalty (str, ‘none’, ‘l2’, ‘l1’, or ‘elasticnet’) – The penalty (aka regularization term) to be used. Defaults to ‘l2’ which is the standard regularizer for linear SVM models. ‘l1’ and ‘elasticnet’ might bring sparsity to the model (feature selection) not achievable with ‘l2’.
  • alpha (float) – Constant that multiplies the regularization term. Defaults to 0.0001 Also used to compute learning_rate when set to ‘optimal’.
  • l1_ratio (float) – The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Defaults to 0.15.
  • fit_intercept (bool) – Whether the intercept should be estimated or not. If False, the data is assumed to be already centered. Defaults to True.
  • max_iter (int, optional (default=1000)) – The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit.
  • tol (float or None, optional (default=1e-3)) – The stopping criterion. If it is not None, the iterations will stop when (loss > best_loss - tol) for n_iter_no_change consecutive epochs.
  • shuffle (bool, optional) – Whether or not the training data should be shuffled after each epoch. Defaults to True.
  • epsilon (float) –

    Epsilon in the epsilon-insensitive loss functions; only if loss is ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’.

    For ‘huber’, determines the threshold at which it becomes less important to get the prediction exactly right. For epsilon-insensitive, any differences between the current prediction and the correct label are ignored if they are less than this threshold.

  • learning_rate (string, optional) –

    The learning rate schedule:

    ‘constant’:
    eta = eta0
    ‘optimal’:
    eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed by Leon Bottou.
    ‘invscaling’: [default]
    eta = eta0 / pow(t, power_t)
    ‘adaptive’:
    eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.
  • eta0 (double) – The initial learning rate for the ‘constant’, ‘invscaling’ or ‘adaptive’ schedules. The default value is 0.01.
  • power_t (double) – The exponent for inverse scaling learning rate [default 0.5].
  • early_stopping (bool, default=False) – Whether to use early stopping to terminate training when validation score is not improving. If set to True, it will automatically set aside a fraction of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs.
  • validation_fraction (float, default=0.1) – The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True.
  • n_iter_no_change (int, default=5) – Number of iterations with no improvement to wait before early stopping.
  • average (bool or int, optional) – When set to True, computes the averaged SGD weights and stores the result in the coef_ attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.
Returns:

RegressionModelAnalysis object to view results and analyze results

Return type:

RegressionModelAnalysis

Examples

>>> model.SGDRegression()
>>> model.SGDRegression(model_name='m1', alpha=0.0003)
>>> model.SGDRegression(cv=10)
>>> model.SGDRegression(gridsearch={'alpha':[0.01, 0.02]}, cv='strat-kfold')
>>> model.SGDRegression(run=False) # Add model to the queue
SVR(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='svr_reg', run=True, verbose=1, **kwargs)

Epsilon-Support Vector Regression.

The free parameters in the model are C and epsilon.

The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using model.linearsvr or model.sgd_regression instead

For more Support Vector info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name (str, optional) – Name for this model, by default “linsvr”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • kernel (string, optional (default=’rbf’)) – Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).
  • degree (int, optional (default=3)) – Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
  • gamma (float, optional (default=’auto’)) – Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Current default is ‘auto’ which uses 1 / n_features, if gamma=’scale’ is passed then it uses 1 / (n_features * X.var()) as value of gamma.
  • coef0 (float, optional (default=0.0)) – Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.
  • tol (float, optional (default=1e-3)) – Tolerance for stopping criterion.
  • C (float, optional (default=1.0)) – Penalty parameter C of the error term.
  • epsilon (float, optional (default=0.1)) – Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value.
  • shrinking (boolean, optional (default=True)) – Whether to use the shrinking heuristic.
  • cache_size (float, optional) – Specify the size of the kernel cache (in MB).
  • max_iter (int, optional (default=-1)) – Hard limit on iterations within solver, or -1 for no limit.
Returns:

RegressionModelAnalysis object to view results and analyze results

Return type:

RegressionModelAnalysis

Examples

>>> model.SVR()
>>> model.SVR(model_name='m1', C=0.0003)
>>> model.SVR(cv=10)
>>> model.SVR(gridsearch={'C':[0.01, 0.02]}, cv='strat-kfold')
>>> model.SVR(run=False) # Add model to the queue
XGBoostRegression(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='xgb_reg', run=True, verbose=1, **kwargs)

Trains an XGBoost Regression Model.

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

For more XGBoost info, you can view it here: https://xgboost.readthedocs.io/en/latest/ and https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst.

If running gridsearch, the implemented cross validators are:
  • ‘kfold’ for KFold
  • ‘strat-kfold’ for StratifiedKfold
Possible scoring metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
Parameters:
  • cv (bool, optional) – If True run crossvalidation on the model, by default None.
  • gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
  • score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
  • model_name (str, optional) – Name for this model, by default “xgb_reg”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • max_depth (int) – Maximum tree depth for base learners. By default 3
  • learning_rate (float) – Boosting learning rate (xgb’s “eta”). By default 0.1
  • n_estimators (int) – Number of trees to fit. By default 100.
  • objective (string or callable) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below). By default, reg:linear
  • booster (string) – Specify which booster to use: gbtree, gblinear or dart. By default ‘gbtree’
  • tree_method (string) – Specify which tree method to use If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from parameters document. By default ‘auto’
  • gamma (float) – Minimum loss reduction required to make a further partition on a leaf node of the tree. By default 0
  • subsample (float) – Subsample ratio of the training instance. By default 1
  • reg_alpha (float (xgb's alpha)) – L1 regularization term on weights. By default 0
  • reg_lambda (float (xgb's lambda)) – L2 regularization term on weights. By default 1
  • scale_pos_weight (float) – Balancing of positive and negative weights. By default 1
  • base_score – The initial prediction score of all instances, global bias. By default 0
  • missing (float, optional) – Value in the data which needs to be present as a missing value. If None, defaults to np.nan. By default, None
  • num_parallel_tree (int) – Used for boosting random forest. By default 1
  • importance_type (string, default "gain") – The feature importance type for the feature_importances_ property: either “gain”, “weight”, “cover”, “total_gain” or “total_cover”. By default ‘gain’.

Note

A custom objective function can be provided for the objective parameter. In this case, it should have the signature objective(y_true, y_pred) -> grad, hess:

y_true: array_like of shape [n_samples]
The target values
y_pred: array_like of shape [n_samples]
The predicted values
grad: array_like of shape [n_samples]
The value of the gradient for each sample point.
hess: array_like of shape [n_samples]
The value of the second derivative for each sample point
Returns:RegressionModelAnalysis object to view results and analyze results
Return type:RegressionModelAnalysis

Examples

>>> model.XGBoostRegression()
>>> model.XGBoostRegression(model_name='m1', reg_alpha=0.0003)
>>> model.XGBoostRegression(cv=10)
>>> model.XGBoostRegression(gridsearch={'reg_alpha':[0.01, 0.02]}, cv='strat-kfold')
>>> model.XGBoostRegression(run=False) # Add model to the queue
class aethos.modelling.unsupervised_models.Unsupervised(x_train, exp_name='my-experiment')

Bases: aethos.modelling.model.ModelBase, aethos.analysis.Analysis, aethos.cleaning.clean.Clean, aethos.preprocessing.preprocess.Preprocess, aethos.feature_engineering.feature.Feature, aethos.visualizations.visualizations.Visualizations, aethos.stats.stats.Stats

AgglomerativeClustering(model_name='agglom', run=True, **kwargs)

Trains a Agglomerative Clustering Model

Each data point as a single cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all data points

Hierarchical clustering does not require us to specify the number of clusters and we can even select which number of clusters looks best since we are building a tree.

Additionally, the algorithm is not sensitive to the choice of distance metric; all of them tend to work equally well whereas with other clustering algorithms, the choice of distance metric is critical.

A particularly good use case of hierarchical clustering methods is when the underlying data has a hierarchical structure and you want to recover the hierarchy; other clustering algorithms can’t do this.

These advantages of hierarchical clustering come at the cost of lower efficiency, as it has a time complexity of O(n³), unlike the linear complexity of K-Means and GMM.

For a list of all possible options for Agglomerative clustering please visit: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering

Parameters:
  • model_name (str, optional) – Name for this model, by default “agglom”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • n_clusters (int or None, optional (default=2)) – The number of clusters to find. It must be None if distance_threshold is not None.
  • affinity (string or callable, default: “euclidean”) – Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. If linkage is “ward”, only “euclidean” is accepted. If “precomputed”, a distance matrix (instead of a similarity matrix) is needed as input for the fit method.
  • compute_full_tree (bool or ‘auto’ (optional)) – Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be True if distance_threshold is not None.
  • linkage ({“ward”, “complete”, “average”, “single”}, optional (default=”ward”)) –

    Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.

    ’ward’ minimizes the variance of the clusters being merged. ‘average’ uses the average of the distances of each observation of the two sets. ‘complete’ or maximum linkage uses the maximum distances between all observations of the two sets. ‘single’ uses the minimum of the distances between all observations of the two sets.
  • distance_threshold (float, optional (default=None)) – The linkage distance threshold above which, clusters will not be merged. If not None, n_clusters must be None and compute_full_tree must be True.
Returns:

UnsupervisedModelAnalysis object to view results and further analysis

Return type:

UnsupervisedModelAnalysis

Examples

>>> model.AgglomerativeClustering()
>>> model.AgglomerativeClustering(model_name='ag_1, n_clusters=5)
>>> model.AgglomerativeClustering(run=False) # Add model to the queue
DBScan(model_name='dbs', run=True, verbose=1, **kwargs)

Based on a set of points (let’s think in a bidimensional space as exemplified in the figure), DBSCAN groups together points that are close to each other based on a distance measurement (usually Euclidean distance) and a minimum number of points. It also marks as outliers the points that are in low-density regions.

The DBSCAN algorithm should be used to find associations and structures in data that are hard to find manually but that can be relevant and useful to find patterns and predict trends.

For a list of all possible options for DBSCAN please visit: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

Parameters:
  • model_name (str, optional) – Name for this model, by default “dbscan”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • eps (float) – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
  • min_samples (int, optional) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
  • metric (string, or callable) – The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by sklearn. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.
  • p (float, optional) – The power of the Minkowski metric to be used to calculate distance between points.
  • n_jobs (int or None, optional (default=None)) – The number of parallel jobs to run. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
Returns:

UnsupervisedModelAnalysis object to view results and further analysis

Return type:

UnsupervisedModelAnalysis

Examples

>>> model.DBScan()
>>> model.DBScan(model_name='dbs_1, min_samples=5)
>>> model.DBScan(run=False) # Add model to the queue
GaussianMixtureClustering(model_name='gm_cluster', run=True, verbose=1, **kwargs)

Trains a GaussianMixture algorithm that implements the expectation-maximization algorithm for fitting mixture of Gaussian models.

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.

There are 2 key advantages to using GMMs.

Firstly GMMs are a lot more flexible in terms of cluster covariance than K-Means; due to the standard deviation parameter, the clusters can take on any ellipse shape, rather than being restricted to circles.

K-Means is actually a special case of GMM in which each cluster’s covariance along all dimensions approaches 0. Secondly, since GMMs use probabilities, they can have multiple clusters per data point.

So if a data point is in the middle of two overlapping clusters, we can simply define its class by saying it belongs X-percent to class 1 and Y-percent to class 2. I.e GMMs support mixed membership.

For more information on Gaussian Mixture algorithms please visit: https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture

Parameters:
  • model_name (str, optional) – Name for this model, by default “gm_cluster”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • n_components (int, defaults to 1.) – The number of mixture components/ number of unique y_train values.
  • covariance_type ({‘full’ (default), ‘tied’, ‘diag’, ‘spherical’}) –

    String describing the type of covariance parameters to use. Must be one of:

    ‘full’
    each component has its own general covariance matrix
    ‘tied’
    all components share the same general covariance matrix
    ‘diag’
    each component has its own diagonal covariance matrix
    ‘spherical’
    each component has its own single variance
  • tol (float, defaults to 1e-3.) – The convergence threshold. EM iterations will stop when the lower bound average gain is below this threshold.
  • reg_covar (float, defaults to 1e-6.) – Non-negative regularization added to the diagonal of covariance. Allows to assure that the covariance matrices are all positive.
  • max_iter (int, defaults to 100.) – The number of EM iterations to perform.
  • n_init (int, defaults to 1.) – The number of initializations to perform. The best results are kept.
  • init_params ({‘kmeans’, ‘random’}, defaults to ‘kmeans’.) –

    The method used to initialize the weights, the means and the precisions. Must be one of:

    ’kmeans’ : responsibilities are initialized using kmeans. ‘random’ : responsibilities are initialized randomly.

  • weights_init (array-like, shape (n_components, ), optional) – The user-provided initial weights If it None, weights are initialized using the init_params method. Defaults to None.
  • means_init (array-like, shape (n_components, n_features), optional) – The user-provided initial means If it None, means are initialized using the init_params method. Defaults to None
  • precisions_init (array-like, optional.) –

    The user-provided initial precisions (inverse of the covariance matrices), defaults to None. If it None, precisions are initialized using the ‘init_params’ method. The shape depends on ‘covariance_type’:

    (n_components,) if ‘spherical’, (n_features, n_features) if ‘tied’, (n_components, n_features) if ‘diag’, (n_components, n_features, n_features) if ‘full’

Returns:

UnsupervisedModelAnalysis object to view results and further analysis

Return type:

UnsupervisedModelAnalysis

Examples

>>> model.GuassianMixtureClustering()
>>> model.GuassianMixtureClustering(model_name='gm_1, max_iter=1000)
>>> model.GuassianMixtureClustering(run=False) # Add model to the queue
IsolationForest(model_name='iso_forest', run=True, verbose=1, **kwargs)

Isolation Forest Algorithm

Return the anomaly score of each sample using the IsolationForest algorithm

The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

For more Isolation Forest info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html#sklearn.ensemble.IsolationForest

Parameters:
  • model_name (str, optional) – Name for this model, by default “iso_forest”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • n_estimators (int, optional (default=100)) – The number of base estimators in the ensemble.
  • max_samples (int or float, optional (default=”auto”)) –

    The number of samples to draw from X to train each base estimator.

    If int, then draw max_samples samples. If float, then draw max_samples * X.shape[0] samples. If “auto”, then max_samples=min(256, n_samples).

    If max_samples is larger than the number of samples provided, all samples will be used for all trees (no sampling).

  • contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function. If ‘auto’, the decision function threshold is determined as in the original paper.
  • max_features (int or float, optional (default=1.0)) –

    The number of features to draw from X to train each base estimator.

    If int, then draw max_features features. If float, then draw max_features * X.shape[1] features.
  • bootstrap (boolean, optional (default=False)) – If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.
Returns:

UnsupervisedModelAnalysis object to view results and analyze results

Return type:

UnsupervisedModelAnalysis

Examples

>>> model.IsolationForest()
>>> model.IsolationForest(model_name='iso_1, max_features=5)
>>> model.IsolationForest(run=False) # Add model to the queue
KMeans(model_name='km', run=True, verbose=1, **kwargs)

NOTE: If ‘n_clusters’ is not provided, k will automatically be determined using an elbow plot using distortion as the mteric to find the optimal number of clusters.

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.

The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.

In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.

For a list of all possible options for K Means clustering please visit: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Parameters:
  • model_name (str, optional) – Name for this model, by default “kmeans”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • n_clusters (int, optional, default: 8) – The number of clusters to form as well as the number of centroids to generate.
  • init ({‘k-means++’, ‘random’ or an ndarray}) –
    Method for initialization, defaults to ‘k-means++’:
    ‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

    ‘random’: choose k observations (rows) at random from data for the initial centroids.

    If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

  • n_init (int, default: 10) – Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
  • max_iter (int, default: 300) – Maximum number of iterations of the k-means algorithm for a single run.
  • random_state (int, RandomState instance or None (default)) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See Glossary.
  • algorithm (“auto”, “full” or “elkan”, default=”auto”) – K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data.
Returns:

UnsupervisedModelAnalysis object to view results and further analysis

Return type:

UnsupervisedModelAnalysis

Examples

>>> model.KMeans()
>>> model.KMeans(model_name='kmean_1, n_cluster=5)
>>> model.KMeans(run=False) # Add model to the queue
MeanShift(model_name='mshift', run=True, **kwargs)

Trains a Mean Shift clustering algorithm.

Mean shift clustering aims to discover “blobs” in a smooth density of samples.

It is a centroid-based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region.

These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.

For more info on Mean Shift clustering please visit: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MeanShift.html#sklearn.cluster.MeanShift

Parameters:
  • model_name (str, optional) – Name for this model, by default “mshift”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • bandwidth (float, optional) –

    Bandwidth used in the RBF kernel.

    If not given, the bandwidth is estimated using sklearn.cluster.estimate_bandwidth; see the documentation for that function for hints on scalability (see also the Notes, below).

  • seeds (array, shape=[n_samples, n_features], optional) – Seeds used to initialize kernels. If not set, the seeds are calculated by clustering.get_bin_seeds with bandwidth as the grid size and default values for other parameters.
  • bin_seeding (boolean, optional) – If true, initial kernel locations are not locations of all points, but rather the location of the discretized version of points, where points are binned onto a grid whose coarseness corresponds to the bandwidth. Setting this option to True will speed up the algorithm because fewer seeds will be initialized. default value: False Ignored if seeds argument is not None.
  • min_bin_freq (int, optional) – To speed up the algorithm, accept only those bins with at least min_bin_freq points as seeds. If not defined, set to 1.
  • cluster_all (boolean, default True) – If true, then all points are clustered, even those orphans that are not within any kernel. Orphans are assigned to the nearest kernel. If false, then orphans are given cluster label -1.
Returns:

UnsupervisedModelAnalysis object to view results and further analysis

Return type:

UnsupervisedModelAnalysis

Examples

>>> model.MeanShift()
>>> model.MeanShift(model_name='ms_1', cluster_all=False)
>>> model.MeanShift(run=False) # Add model to the queue
OneClassSVM(model_name='ocsvm', run=True, verbose=1, **kwargs)

Trains a One Class SVM model.

Unsupervised Outlier Detection.

For more Support Vector info, you can view it here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html#sklearn.svm.OneClassSVM

Parameters:
  • model_name (str, optional) – Name for this model, by default “ocsvm”
  • run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
  • verbose (int, optional) – Verbosity level of model output, the higher the number - the more verbose. By default, 1
  • kernel (string, optional (default=’rbf’)) – Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix.
  • degree (int, optional (default=3)) – Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
  • gamma (float, optional (default=’auto’)) – Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Current default is ‘auto’ which uses 1 / n_features, if gamma=’scale’ is passed then it uses 1 / (n_features * X.var()) as value of gamma.
  • coef0 (float, optional (default=0.0)) – Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.
  • tol (float, optional) – Tolerance for stopping criterion.
  • nu (float, optional) – An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.
  • shrinking (boolean, optional) – Whether to use the shrinking heuristic.
  • cache_size (float, optional) – Specify the size of the kernel cache (in MB).
  • max_iter (int, optional (default=-1)) – Hard limit on iterations within solver, or -1 for no limit.
Returns:

UnsupervisedModelAnalysis object to view results and analyze results

Return type:

UnsupervisedModelAnalysis

Examples

>>> model.OneClassSVM()
>>> model.OneClassSVM(model_name='ocs_1, max_iter=100)
>>> model.OneClassSVM(run=False) # Add model to the queue

Model Analysis Module

class aethos.model_analysis.model_analysis.ModelAnalysisBase

Bases: aethos.visualizations.visualizations.Visualizations, aethos.stats.stats.Stats

test_results
to_pickle()

Writes model to a pickle file.

Examples

>>> m = Model(df)
>>> m_results = m.LogisticRegression()
>>> m_results.to_pickle()
to_service(project_name: str)

Creates an app.py, requirements.txt and Dockerfile in ~/.aethos/projects and the necessary folder structure to run the model as a microservice.

Parameters:project_name (str) – Name of the project that you want to create.

Examples

>>> m = Model(df)
>>> m_results = m.LogisticRegression()
>>> m_results.to_service('your_proj_name')
train_results
class aethos.model_analysis.model_analysis.SupervisedModelAnalysis(model, x_train, x_test, y_train, y_test, model_name)

Bases: aethos.model_analysis.model_analysis.ModelAnalysisBase

decision_plot(num_samples=0.6, sample_no=None, highlight_misclassified=False, output_file='', **decisionplot_kwargs)

Visualize model decisions using cumulative SHAP values.

Each colored line in the plot represents the model prediction for a single observation.

Note that plotting too many samples at once can make the plot unintelligible.

When is a decision plot useful:
  • Show a large number of feature effects clearly.
  • Visualize multioutput predictions.
  • Display the cumulative effect of interactions.
  • Explore feature effects for a range of feature values.
  • Identify outliers.
  • Identify typical prediction paths.
  • Compare and contrast predictions for several models.
Explanation:
  • The plot is centered on the x-axis at the models expected value.
  • All SHAP values are relative to the model’s expected value like a linear model’s effects are relative to the intercept.
  • The y-axis lists the model’s features. By default, the features are ordered by descending importance.
  • The importance is calculated over the observations plotted. This is usually different than the importance ordering for the entire dataset. In addition to feature importance ordering, the decision plot also supports hierarchical cluster feature ordering and user-defined feature ordering.
  • Each observation’s prediction is represented by a colored line.
  • At the top of the plot, each line strikes the x-axis at its corresponding observation’s predicted value. This value determines the color of the line on a spectrum.
  • Moving from the bottom of the plot to the top, SHAP values for each feature are added to the model’s base value. This shows how each feature contributes to the overall prediction.
  • At the bottom of the plot, the observations converge at the models expected value.
Parameters:
  • output_file (str) – Output file name including extension (.png, .jpg, etc.) to save image as.
  • num_samples (int, float, or 'all', optional) – Number of samples to display, if less than 1 it will treat it as a percentage, ‘all’ will include all samples , by default 0.6
  • sample_no (int, optional) – Sample number to isolate and analyze, if provided it overrides num_samples, by default None
  • highlight_misclassified (bool, optional) – True to highlight the misclassified results, by default False
  • feature_order (str or None or list or numpy.ndarray) – Any of “importance” (the default), “hclust” (hierarchical clustering), “none”, or a list/array of indices. hclust is useful for finding outliers.
  • feature_display_range (slice or range) – The slice or range of features to plot after ordering features by feature_order. A step of 1 or None will display the features in ascending order. A step of -1 will display the features in descending order. If feature_display_range=None, slice(-1, -21, -1) is used (i.e. show the last 20 features in descending order). If shap_values contains interaction values, the number of features is automatically expanded to include all possible interactions: N(N + 1)/2 where N = shap_values.shape[1].
  • highlight (Any) – Specify which observations to draw in a different line style. All numpy indexing methods are supported. For example, list of integer indices, or a bool array.
  • link (str) – Use “identity” or “logit” to specify the transformation used for the x-axis. The “logit” link transforms log-odds into probabilities.
  • plot_color (str or matplotlib.colors.ColorMap) – Color spectrum used to draw the plot lines. If str, a registered matplotlib color name is assumed.
  • axis_color (str or int) – Color used to draw plot axes.
  • y_demarc_color (str or int) – Color used to draw feature demarcation lines on the y-axis.
  • alpha (float) – Alpha blending value in [0, 1] used to draw plot lines.
  • color_bar (bool) – Whether to draw the color bar.
  • auto_size_plot (bool) – Whether to automatically size the matplotlib plot to fit the number of features displayed. If False, specify the plot size using matplotlib before calling this function.
  • title (str) – Title of the plot.
  • xlim (tuple[float, float]) – The extents of the x-axis (e.g. (-1.0, 1.0)). If not specified, the limits are determined by the maximum/minimum predictions centered around base_value when link=’identity’. When link=’logit’, the x-axis extents are (0, 1) centered at 0.5. x_lim values are not transformed by the link function. This argument is provided to simplify producing multiple plots on the same scale for comparison.
  • show (bool) – Whether to automatically display the plot.
  • return_objects (bool) – Whether to return a DecisionPlotResult object containing various plotting features. This can be used to generate multiple decision plots using the same feature ordering and scale, by default True.
  • ignore_warnings (bool) – Plotting many data points or too many features at a time may be slow, or may create very large plots. Set this argument to True to override hard-coded limits that prevent plotting large amounts of data.
  • new_base_value (float) – SHAP values are relative to a base value; by default, the expected value of the model’s raw predictions. Use new_base_value to shift the base value to an arbitrary value (e.g. the cutoff point for a binary classification task).
  • legend_labels (list of str) – List of legend labels. If None, legend will not be shown.
  • legend_location (str) – Legend location. Any of “best”, “upper right”, “upper left”, “lower left”, “lower right”, “right”, “center left”, “center right”, “lower center”, “upper center”, “center”.
Returns:

If return_objects=True (the default). Returns None otherwise.

Return type:

DecisionPlotResult

Examples

>>> # Plot two decision plots using the same feature order and x-axis.
>>> m = model.LogisticRegression()
>>> r = m.decision_plot()
>>> m.decision_plot(no_sample=42, feature_order=r.feature_idx, xlim=r.xlim)
dependence_plot(feature: str, interaction='auto', output_file='', **dependenceplot_kwargs)

A dependence plot is a scatter plot that shows the effect a single feature has on the predictions made by the mode.

Explanation:
  • Each dot is a single prediction (row) from the dataset.
  • The x-axis is the value of the feature (from the X matrix).
  • The y-axis is the SHAP value for that feature, which represents how much knowing that feature’s value changes the output of the model for that sample’s prediction.
  • The color corresponds to a second feature that may have an interaction effect with the feature we are plotting (by default this second feature is chosen automatically).
  • If an interaction effect is present between this other feature and the feature we are plotting it will show up as a distinct vertical pattern of coloring.
Parameters:
  • feature (str) – Feature who’s impact on the model you want to analyze
  • interaction ("auto", None, int, or string) – The index of the feature used to color the plot. The name of a feature can also be passed as a string. If “auto” then shap.common.approximate_interactions is used to pick what seems to be the strongest interaction (note that to find to true stongest interaction you need to compute the SHAP interaction values).
  • output_file (str) – Output file name including extension (.png, .jpg, etc.) to save image as.
  • x_jitter (float (0 - 1)) – Adds random jitter to feature values. May increase plot readability when feature is discrete.
  • alpha (float) – The transparency of the data points (between 0 and 1). This can be useful to the show density of the data points when using a large dataset.
  • xmin (float or string) – Represents the lower bound of the plot’s x-axis. It can be a string of the format “percentile(float)” to denote that percentile of the feature’s value used on the x-axis.
  • xmax (float or string) – Represents the upper bound of the plot’s x-axis. It can be a string of the format “percentile(float)” to denote that percentile of the feature’s value used on the x-axis.
  • ax (matplotlib Axes object) – Optionally specify an existing matplotlib Axes object, into which the plot will be placed. In this case we do not create a Figure, otherwise we do.
  • cmap (str or matplotlib.colors.ColorMap) – Color spectrum used to draw the plot lines. If str, a registered matplotlib color name is assumed.

Examples

>>> m = model.LogisticRegression()
>>> m.dependence_plot()
force_plot(sample_no=None, misclassified=False, output_file='', **forceplot_kwargs)

Visualize the given SHAP values with an additive force layout

Parameters:
  • sample_no (int, optional) – Sample number to isolate and analyze, by default None
  • misclassified (bool, optional) – True to only show the misclassified results, by default False
  • output_file (str) – Output file name including extension (.png, .jpg, etc.) to save image as.
  • link ("identity" or "logit") – The transformation used when drawing the tick mark labels. Using logit will change log-odds numbers into probabilities.
  • matplotlib (bool) – Whether to use the default Javascript output, or the (less developed) matplotlib output. Using matplotlib can be helpful in scenarios where rendering Javascript/HTML is inconvenient.

Examples

>>> m = model.LogisticRegression()
>>> m.force_plot() # The entire test dataset
>>> m.forceplot(no_sample=1, misclassified=True) # Analyze the first misclassified result
interpret_model(show=True)

Displays a dashboard interpreting your model’s performance, behaviour and individual predictions.

If you have run any other interpret functions, they will be included in the dashboard, otherwise all the other intrepretable methods will be included in the dashboard.

Examples

>>> m = model.LogisticRegression()
>>> m.interpret_model()
interpret_model_behavior(method='all', predictions='default', show=True, **interpret_kwargs)

Provides an interpretable summary of your models behaviour based off an explainer.

Can either be ‘morris’ or ‘dependence’ for Partial Dependence.

If ‘all’ a dashboard is displayed with morris and dependence analysis displayed.

Parameters:
  • method (str, optional) – Explainer type, can either be ‘all’, ‘morris’ or ‘dependence’, by default ‘all’
  • predictions (str, optional) – Prediction type, can either be ‘default’ (.predict) or ‘probability’ if the model can predict probabilities, by default ‘default’
  • show (bool, optional) – False to not display the plot, by default True

Examples

>>> m = model.LogisticRegression()
>>> m.interpret_model_behavior()
interpret_model_performance(method='all', predictions='default', show=True, **interpret_kwargs)

Plots an interpretable display of your model based off a performance metric.

Can either be ‘ROC’ or ‘PR’ for precision, recall for classification problems.

Can be ‘regperf’ for regression problems.

If ‘all’ a dashboard is displayed with the corresponding explainers for the problem type.

ROC: Receiver Operator Characteristic PR: Precision Recall regperf: RegeressionPerf

Parameters:
  • method (str) – Performance metric, either ‘all’, ‘roc’ or ‘PR’, by default ‘all’
  • predictions (str, optional) – Prediction type, can either be ‘default’ (.predict) or ‘probability’ if the model can predict probabilities, by default ‘default’
  • show (bool, optional) – False to not display the plot, by default True

Examples

>>> m = model.LogisticRegression()
>>> m.interpret_model_performance()
interpret_model_predictions(num_samples=0.25, sample_no=None, method='all', predictions='default', show=True, **interpret_kwargs)

Plots an interpretable display that explains individual predictions of your model.

Supported explainers are either ‘lime’ or ‘shap’.

If ‘all’ a dashboard is displayed with morris and dependence analysis displayed.

Parameters:
  • num_samples (int, float, or 'all', optional) – Number of samples to display, if less than 1 it will treat it as a percentage, ‘all’ will include all samples , by default 0.25
  • sample_no (int, optional) – Sample number to isolate and analyze, if provided it overrides num_samples, by default None
  • method (str, optional) – Explainer type, can either be ‘all’, ‘lime’, or ‘shap’, by default ‘all’
  • predictions (str, optional) – Prediction type, can either be ‘default’ (.predict) or ‘probability’ if the model can predict probabilities, by default ‘default’
  • show (bool, optional) – False to not display the plot, by default True

Examples

>>> m = model.LogisticRegression()
>>> m.interpret_model_predictions()
model_weights()

Prints and logs all the features ranked by importance from most to least important.

Returns:Dictionary of features and their corresponding weights
Return type:dict
Raises:AttributeError – If model does not have coefficients to display

Examples

>>> m = model.LogisticRegression()
>>> m.model_weights()
shap_get_misclassified_index()

Prints the sample numbers of misclassified samples.

Examples

>>> m = model.LogisticRegression()
>>> m.shap_get_misclassified_index()
summary_plot(output_file='', **summaryplot_kwargs)

Create a SHAP summary plot, colored by feature values when they are provided.

For a list of all kwargs please see the Shap documentation : https://shap.readthedocs.io/en/latest/#plots

Parameters:
  • output_file (str) – Output file name including extension (.png, .jpg, etc.) to save image as.
  • max_display (int) – How many top features to include in the plot (default is 20, or 7 for interaction plots), by default None
  • plot_type ("dot" (default for single output), "bar" (default for multi-output), "violin", or "compact_dot") – What type of summary plot to produce. Note that “compact_dot” is only used for SHAP interaction values.
  • color (str or matplotlib.colors.ColorMap) – Color spectrum used to draw the plot lines. If str, a registered matplotlib color name is assumed.
  • axis_color (str or int) – Color used to draw plot axes.
  • title (str) – Title of the plot.
  • alpha (float) – Alpha blending value in [0, 1] used to draw plot lines.
  • show (bool) – Whether to automatically display the plot.
  • sort (bool) – Whether to sort features by importance, by default True
  • color_bar (bool) – Whether to draw the color bar.
  • auto_size_plot (bool) – Whether to automatically size the matplotlib plot to fit the number of features displayed. If False, specify the plot size using matplotlib before calling this function.
  • layered_violin_max_num_bins (int) – Max number of bins, by default 20
  • **summaryplot_kwargs – For more info see https://shap.readthedocs.io/en/latest/#plots

Examples

>>> m = model.LogisticRegression()
>>> m.summary_plot()
view_tree(tree_num=0, output_file=None, **kwargs)

Plot decision trees.

Parameters:
  • tree_num (int, optional) – For ensemble, boosting, and stacking methods - the tree number to plot, by default 0
  • output_file (str, optional) – Name of the file including extension, by default None

Examples

>>> m = model.DecisionTreeClassifier()
>>> m.view_tree()
>>> m = model.XGBoostClassifier()
>>> m.view_tree(2)
class aethos.model_analysis.classification_model_analysis.ClassificationModelAnalysis(model, x_train, x_test, target, model_name)

Bases: aethos.model_analysis.model_analysis.SupervisedModelAnalysis

accuracy(**kwargs)

It measures how many observations, both positive and negative, were correctly classified.

Returns:Accuracy
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.accuracy()
average_precision(**kwargs)

AP summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight

Returns:Average Precision Score
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.average_precision()
balanced_accuracy(**kwargs)

The balanced accuracy in binary and multiclass classification problems to deal with imbalanced datasets. It is defined as the average of recall obtained on each class.

The best value is 1 and the worst value is 0 when adjusted=False.

Returns:Balanced accuracy
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.balanced_accuracy()
brier_loss(**kwargs)

Compute the Brier score. The smaller the Brier score, the better, hence the naming with “loss”. Across all items in a set N predictions, the Brier score measures the mean squared difference between (1) the predicted probability assigned to the possible outcomes for item i, and (2) the actual outcome. Therefore, the lower the Brier score is for a set of predictions, the better the predictions are calibrated.

The Brier score is appropriate for binary and categorical outcomes that can be structured as true or false, but is inappropriate for ordinal variables which can take on three or more values (this is because the Brier score assumes that all possible outcomes are equivalently “distant” from one another)

Returns:Brier loss
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.brier_loss()
classification_report()

Prints and logs the classification report.

The classification report displays and logs the information in this format:

precision recall f1-score support

1 1.00 0.67 0.80 3 2 0.00 0.00 0.00 0 3 0.00 0.00 0.00 0

micro avg 1.00 0.67 0.80 3 macro avg 0.33 0.22 0.27 3

weighted avg 1.00 0.67 0.80 3

Examples

>>> m = model.LogisticRegression()
>>> m.classification_report()
cohen_kappa(**kwargs)

Cohen Kappa tells you how much better is your model over the random classifier that predicts based on class frequencies

This measure is intended to compare labelings by different human annotators, not a classifier versus a ground truth.

The kappa score (see docstring) is a number between -1 and 1. Scores above .8 are generally considered good agreement; zero or lower means no agreement (practically random labels).

Returns:Cohen Kappa score.
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.cohen_kappa()
confusion_matrix(title=None, normalize=False, hide_counts=False, x_tick_rotation=0, figsize=None, cmap='Blues', title_fontsize='large', text_fontsize='medium', output_file='')

Prints a confusion matrix as a heatmap.

Parameters:
  • title (str) – The text to display at the top of the matrix, by default ‘Confusion Matrix’
  • normalize (bool) – If False, plot the raw numbers If True, plot the proportions, by default False
  • hide_counts (bool) – If False, display the counts and percentage If True, hide display of the counts and percentage by default, False
  • x_tick_rotation (int) – Degree of rotation to rotate the x ticks by default, 0
  • figsize (tuple(int, int)) – Size of the figure by default, None
  • cmap (str) – The gradient of the values displayed from matplotlib.pyplot.cm see http://matplotlib.org/examples/color/colormaps_reference.html plt.get_cmap(‘jet’) or plt.cm.Blues by default, ‘Blues’
  • title_fontsize (str) – Size of the title, by default ‘large’
  • text_fontsize (str) – Size of the text of the rest of the plot, by default ‘medium’
  • output_file (str) – Output file name including extension (.png, .jpg, etc.) to save image as.

Examples

>>> m = model.LogisticRegression()
>>> m.confusion_matrix()
>>> m.confusion_matrix(normalize=True)
cross_validate(cv_type='strat-kfold', score='accuracy', n_splits=5, shuffle=False, **kwargs)

Runs cross validation on a Classification model.

Scoring Metrics:
  • ‘accuracy’
  • ‘balanced_accuracy’
  • ‘average_precision’
  • ‘brier_score_loss’
  • ‘f1’
  • ‘f1_micro’
  • ‘f1_macro’
  • ‘f1_weighted’
  • ‘f1_samples’
  • ‘neg_log_loss’
  • ‘precision’
  • ‘recall’
  • ‘jaccard’
  • ‘roc_auc’’
  • ‘roc_auc_ovr’
  • ‘roc_auc_ovo’
  • ‘roc_auc_ovr_weighted’
  • ‘roc_auc_ovo_weighted’
Parameters:
  • cv_type ({kfold, strat-kfold}, optional) – Crossvalidation type, by default “kfold”
  • score (str, optional) – Scoring metric, by default “accuracy”
  • n_splits (int, optional) – Number of times to split the data, by default 5
  • shuffle (bool, optional) – True to shuffle the data, by default False
decision_boundary(x=None, y=None, title='Decisioun Boundary')

Plots a decision boundary for a given model.

If no x or y columns are provided, it defaults to the first 2 columns of your data.

Parameters:
  • x (str, optional) – Column in the dataframe to plot, Feature one, by default None
  • y (str, optional) – Column in the dataframe to plot, Feature two, by default None
  • title (str, optional) – Title of the decision boundary plot, by default “Decisioun Boundary”
f1(**kwargs)

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)

In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending on the average parameter.

Returns:F1 Score
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.f1()
fbeta(beta=0.5, **kwargs)

The F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. The beta parameter determines the weight of recall in the combined score. Beta < 1 lends more weight to precision, while beta > 1 favors recall (beta -> 0 considers only precision, beta -> inf only recall).

Parameters:beta (float, optional) – Weight of precision in harmonic mean, by default 0.5
Returns:Fbeta score
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.fbeta()
hamming_loss(**kwargs)

The Hamming loss is the fraction of labels that are incorrectly predicted.

Returns:Hamming loss
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.hamming_loss()
hinge_loss(**kwargs)

Computes the average distance between the model and the data using hinge loss, a one-sided metric that considers only prediction errors.

Returns:Hinge loss
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.hinge_loss()
jaccard(**kwargs)

The Jaccard index, or Jaccard similarity coefficient, defined as the size of the intersection divided by the size of the union of two label sets, is used to compare set of predicted labels for a sample to the corresponding set of labels in y_true.

Returns:Jaccard Score
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.jaccard()
log_loss(**kwargs)

Log loss, aka logistic loss or cross-entropy loss.

This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions.

Returns:Log loss
Return type:Float

Examples

>>> m = model.LogisticRegression()
>>> m.log_loss()
matthews_corr_coef(**kwargs)

The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient.

Returns:Matthews Correlation Coefficient
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.mathews_corr_coef()
metrics(*metrics)

Measures how well your model performed against certain metrics.

For multiclassification problems, the ‘macro’ average is used.

If a project metrics has been specified, it will display those metrics, otherwise it will display the specified metrics or all metrics.

For more detailed information and parameters please see the following link: https://scikit-learn.org/stable/modules/classes.html#classification-metrics

Supported metrics are:

‘Accuracy’: ‘Measures how many observations, both positive and negative, were correctly classified.’,

‘Balanced Accuracy’: ‘The balanced accuracy in binary and multiclass classification problems to deal with imbalanced datasets. It is defined as the average of recall obtained on each class.’,

‘Average Precision’: ‘Summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold’,

‘ROC AUC’: ‘Shows how good at ranking predictions your model is. It tells you what is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.’,

‘Zero One Loss’: ‘Fraction of misclassifications.’,

‘Precision’: ‘It measures how many observations predicted as positive are positive. Good to use when False Positives are costly.’,

‘Recall’: ‘It measures how many observations out of all positive observations have we classified as positive. Good to use when catching call positive occurences, usually at the cost of false positive.’,

‘Matthews Correlation Coefficient’: ‘It’s a correlation between predicted classes and ground truth.’,

‘Log Loss’: ‘Difference between ground truth and predicted score for every observation and average those errors over all observations.’,

‘Jaccard’: ‘Defined as the size of the intersection divided by the size of the union of two label sets, is used to compare set of predicted labels for a sample to the corresponding set of true labels.’,

‘Hinge Loss’: ‘Computes the average distance between the model and the data using hinge loss, a one-sided metric that considers only prediction errors.’,

‘Hamming Loss’: ‘The Hamming loss is the fraction of labels that are incorrectly predicted.’,

‘F-Beta’: ‘It’s the harmonic mean between precision and recall, with an emphasis on one or the other. Takes into account both metrics, good for imbalanced problems (spam, fraud, etc.).’,

‘F1’: ‘It’s the harmonic mean between precision and recall. Takes into account both metrics, good for imbalanced problems (spam, fraud, etc.).’,

‘Cohen Kappa’: ‘Cohen Kappa tells you how much better is your model over the random classifier that predicts based on class frequencies. Works well for imbalanced problems.’,

‘Brier Loss’: ‘It is a measure of how far your predictions lie from the true values. Basically, it is a mean square error in the probability space.’

Parameters:metrics (str(s), optional) – Specific type of metrics to view

Examples

>>> m = model.LogisticRegression()
>>> m.metrics()
>>> m.metrics('F1', 'F-Beta')
precision(**kwargs)

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives.

The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The best value is 1 and the worst value is 0.

Returns:Precision
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.precision()
recall(**kwargs)

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives.

The recall is intuitively the ability of the classifier to find all the positive samples.

The best value is 1 and the worst value is 0.

Returns:Recall
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.recall()
roc_auc(**kwargs)

This metric tells us that this metric shows how good at ranking predictions your model is. It tells you what is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.

Returns:ROC AUC Score
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.roc_auc()
roc_curve(title=True, output_file='')

Plots an ROC curve and displays the ROC statistics (area under the curve).

Parameters:
  • figsize (tuple(int, int), optional) – Figure size, by default (600,450)
  • title (bool) – Whether to display title, by default True
  • output_file (str, optional) – If a name is provided save the plot to an html file, by default ‘’

Examples

>>> m = model.LogisticRegression()
>>> m.roc_curve()
zero_one_loss(**kwargs)

Return the fraction of misclassifications (float), else it returns the number of misclassifications (int).

The best performance is 0.

Returns:Zero one loss
Return type:float

Examples

>>> m = model.LogisticRegression()
>>> m.zero_one_loss()
class aethos.model_analysis.regression_model_analysis.RegressionModelAnalysis(model, x_train, x_test, target, model_name)

Bases: aethos.model_analysis.model_analysis.SupervisedModelAnalysis

cross_validate(cv_type='kfold', score='neg_root_mean_squared_error', n_splits=5, shuffle=False, **kwargs)

Runs cross validation on a Regression model.

Scoring Metrics:
  • ‘explained_variance’
  • ‘max_error’
  • ‘neg_mean_absolute_error’ –> MAE
  • ‘neg_mean_squared_error’ –> MSE
  • ‘neg_mean_squared_log_error’ –> MSLE
  • ‘neg_median_absolute_error’ –> MeAE
  • ‘r2’
  • ‘neg_mean_poisson_deviance’
  • ‘neg_mean_gamma_deviance’
Parameters:
  • cv_type ({kfold, strat-kfold}, optional) – Crossvalidation type, by default “kfold”
  • score (str, optional) – Scoring metric, by default “accuracy”
  • n_splits (int, optional) – Number of times to split the data, by default 5
  • shuffle (bool, optional) – True to shuffle the data, by default False
explained_variance(multioutput='uniform_average', **kwargs)

Explained variance regression score function

Best possible score is 1.0, lower values are worse.

Parameters:multioutput (string in [‘raw_values’, ‘uniform_average’, ‘variance_weighted’] or array-like of shape (n_outputs)) –

Defines aggregating of multiple output scores. Array-like value defines weights used to average scores.

‘raw_values’ :
Returns a full set of scores in case of multioutput input.
‘uniform_average’ :
Scores of all outputs are averaged with uniform weight.
‘variance_weighted’ :
Scores of all outputs are averaged, weighted by the variances of each individual output.

By default ‘uniform_average’

Returns:Explained Variance
Return type:float

Examples

>>> m = model.LinearRegression()
>>> m.explained_variance()
max_error()

Returns the single most maximum residual error.

Returns:Max error
Return type:float

Examples

>>> m = model.LinearRegression()
>>> m.max_error()
mean_abs_error(**kwargs)

Mean absolute error.

Returns:Mean absolute error.
Return type:float

Examples

>>> m = model.LinearRegression()
>>> m.mean_abs_error()
mean_sq_error(**kwargs)

Mean squared error.

Returns:Mean squared error.
Return type:float

Examples

>>> m = model.LinearRegression()
>>> m.mean_sq_error()
mean_sq_log_error(**kwargs)

Mean squared log error.

Returns:Mean squared log error.
Return type:float

Examples

>>> m = model.LinearRegression()
>>> m.mean_sq_log_error()
median_abs_error(**kwargs)

Median absolute error.

Returns:Median absolute error.
Return type:float

Examples

>>> m = model.LinearRegression()
>>> m.median_abs_error()
metrics(*metrics)

Measures how well your model performed against certain metrics.

If a project metrics has been specified, it will display those metrics, otherwise it will display the specified metrics or all metrics.

For more detailed information and parameters please see the following link: https://scikit-learn.org/stable/modules/classes.html#regression-metrics

Supported metrics are:

‘Explained Variance’: ‘Explained variance regression score function. Best possible score is 1.0, lower values are worse.’,

‘Max Error’: ‘Returns the single most maximum residual error.’,

‘Mean Absolute Error’: ‘Postive mean value of all residuals’,

‘Mean Squared Error’: ‘Mean of the squared sum the residuals’,

‘Root Mean Sqaured Error’: ‘Square root of the Mean Squared Error’,

‘Mean Squared Log Error’: ‘Mean of the squared sum of the log of all residuals’,

‘Median Absolute Error’: ‘Postive median value of all residuals’,

‘R2’: ‘R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that is explained by an independent variable or variables in a regression model.’,

‘SMAPE’: ‘Symmetric mean absolute percentage error. It is an accuracy measure based on percentage (or relative) errors.’

Parameters:metrics (str(s), optional) – Specific type of metrics to view

Examples

>>> m = model.LinearRegression()
>>> m.metrics()
>>> m.metrics('SMAPE', 'Root Mean Squared Error')
plot_predicted_actual(output_file='', **scatterplot_kwargs)

Plots the actual data vs. predictions

Parameters:output_file (str, optional) – Output file name, by default “”
r2(**kwargs)

R^2 (coefficient of determination) regression score function.

R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that is explained by an independent variable or variables in a regression model.

Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

Returns:R2 coefficient.
Return type:float

Examples

>>> m = model.LinearRegression()
>>> m.r2()
root_mean_sq_error()

Root mean squared error.

Calculated by taking the square root of the Mean Squared Error.

Returns:Root mean squared error.
Return type:float

Examples

>>> m = model.LinearRegression()
>>> m.root_mean_sq_error()
smape(**kwargs)

Symmetric mean absolute percentage error.

It is an accuracy measure based on percentage (or relative) errors.

Returns:SMAPE
Return type:float

Examples

>>> m = model.LinearRegression()
>>> m.smape()
class aethos.model_analysis.unsupervised_model_analysis.UnsupervisedModelAnalysis(model, data, model_name)

Bases: aethos.model_analysis.model_analysis.ModelAnalysisBase

filter_cluster(cluster_no: int)

Filters data by a cluster number for analysis.

Parameters:cluster_no (int) – Cluster number to filter by
Returns:Filtered data or test dataframe
Return type:Dataframe

Examples

>>> m = model.KMeans()
>>> m.filter_cluster(1)
plot_clusters(dim=2, reduce='pca', output_file='', **kwargs)

Plots the clusters in either 2d or 3d space with each cluster point highlighted as a different colour.

For 2d plotting options, see:

For 3d plotting options, see:

Parameters:
  • dim (2 or 3, optional) – Dimension of the plot, either 2 for 2d, 3 for 3d, by default 2
  • reduce (str {'pca', 'tvsd', 'lle', 'tsne'}, optional) – Dimension reduction strategy i.e. pca, by default “pca”
  • output_file (str) – Output file name including extension (.png, .jpg, etc.) to save image as.

Examples

>>> m = model.KMeans()
>>> m.plot_clusters()
>>> m.plot_clusters(dim=3)
class aethos.model_analysis.text_model_analysis.TextModelAnalysis(model, data, model_name, **kwargs)

Bases: aethos.model_analysis.model_analysis.ModelAnalysisBase

coherence_score(col_name)

Displays the coherence score of the topic model.

For more info on topic coherence: https://rare-technologies.com/what-is-topic-coherence/

Parameters:col_name (str) – Column name that was used as input for the LDA model

Examples

>>> m = model.LDA()
>>> m.coherence_score()
model_perplexity()

Displays the model perplexity of the topic model.

Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. It may be used to compare probability models.

A low perplexity indicates the probability distribution is good at predicting the sample.

Examples

>>> m = model.LDA()
>>> m.model_perplexity()
view(original_text, model_output)

View the original text and the model output in a more user friendly format

Parameters:
  • original_text (str) – Column name of the original text
  • model_output (str) – Column name of the model text

Examples

>>> m = model.LDA()
>>> m.view('original_text_col_name', 'model_output_col_name')
view_topic(topic_num: int, **kwargs)

View a specific topic from topic modelling model.

Parameters:topic_num (int) –
Returns:String representation of topic and probabilities
Return type:str

Examples

>>> m = model.LDA()
>>> m.view_topic(1)
view_topics(num_topics=10, **kwargs)

View topics from topic modelling model.

Parameters:num_topics (int, optional) – Number of topics to view, by default 10
Returns:String representation of topics and probabilities
Return type:str

Examples

>>> m = model.LDA()
>>> m.view_topics()
visualize_topics(**kwargs)

Visualize topics using pyLDAvis.

Parameters:
  • R (int) – The number of terms to display in the barcharts of the visualization. Default is 30. Recommended to be roughly between 10 and 50.
  • lambda_step (float, between 0 and 1) – Determines the interstep distance in the grid of lambda values over which to iterate when computing relevance. Default is 0.01. Recommended to be between 0.01 and 0.1.
  • mds (function or {'tsne', 'mmds}) – A function that takes topic_term_dists as an input and outputs a n_topics by 2 distance matrix. The output approximates the distance between topics. See js_PCoA() for details on the default function. A string representation currently accepts pcoa (or upper case variant), mmds (or upper case variant) and tsne (or upper case variant), if sklearn package is installed for the latter two.
  • n_jobs (int) – The number of cores to be used to do the computations. The regular joblib conventions are followed so -1, which is the default, will use all cores.
  • plot_opts (dict, with keys ‘xlab’ and ylab) – Dictionary of plotting options, right now only used for the axis labels.
  • sort_topics (bool) – Sort topics by topic proportion (percentage of tokens covered). Set to false to keep original topic order.

Examples

>>> m = model.LDA()
>>> m.visualize_topics()