“A collection of tools for Data Scientists and ML Engineers for them to focus less on how to do the analysis and instead worry about what are the best analytic tools that will help gain the most insights from their data.”
Welcome to aethos’s documentation!¶
Aethos is a library/platform that automates your data science and analytical tasks at any stage in the pipeline. Aethos is, at its core, a uniform API that helps automate analytical techniques from various libaries such as pandas, scikit learn, gensim, etc.
Aethos provides:
 Automated data science cleaning, preprocessing, feature engineering and modelling techniques through one line of code
 Automated visualizations through one line of code
 Reusable code  no more copying code from notebook to notebook
 Automated dependency and corpus management
 Datascience project templates
 Integrated 3rd party jupyter plugins to make analyzing data more friendly
 Model analysis use cases  Confusion Matrix, ROC Curve, all metrics, decision tree plots, etc.
 Model interpretability  Local through SHAP and LIME, global through Morris Sensitivity
 Interactive checklists and tips to either remind or help you through your analysis.
 Comparing train and test data distribution
 Exporting trained models as a service (Generates the necessary code, files and folder structure)
 Experiment tracking with MLFlow
 Statistical tests  Anova, Ttest, etc.
 Pretrained models  BERT, GPT2, etc.
Plus more coming soon such as:
 Testing for model drift
 Recommendation models
 Parralelization through Dask and/or Spark
 Uniform API for deep learning models and automated code and file generation jupyter notebook development, python file of your data pipeline.
 Automated code and file generation for jupyter notebook development and a python file of your data pipeline.
Aethos makes it easy to PoC, experiment and compare different techniques and models from various libraries. From imputations, visualizations, scaling, dimensionality reduction, feature engineering to modelling, model results and model deployment  all done with a single, human readable, line of code!
Aethos utilizes other open source libraries to help enhance your analysis from enhanced stastical information, interactive visual plots or statistical tests and models  all your tools in one place, all accessible with one line of code.
For more info such as features, development plan, status and vision checkout the Aethos github page.
Usage¶
Examples can be viewed here.
To start, we need to import the ethos dependencies as well as pandas.
Before that, we can create a full data science folder structure by running aethos create
from the command line and follow the command prompts.
For a full list of methods please see the full docs or TECHNIQUES.md.
Options¶
To enable extensions, such as QGrid interactive filtering:
import aethos as at
at.options.interactive_df = True
Currently the following options are:
 interactive_df: Interactive grid with QGrid
 interactive_table: Interactive grid with Itable  comes with built in client side searching
 project_metrics: Setting project metrics  Project metrics is a metric or set of metrics to evaluate models.
 track_experiments: Uses MLFlow to track models and experiments.
User options such as changing the directory where images and projects are saved can be edited in the config file. This is located at USER_HOME/.aethos/ .
This location is also the default location of where any images and projects are stored.
*New in 2.0*
The Data and Model objects no longer exist but instead there a multiple objects you can use with more of a purpose.
Analysis  Used to analyze, visualize and run statistical models (ttests, anovas, etc.)
Classification  Used to analyze, visualize, run statistical models and train classification models.
Regression  Used to analyze, visualize, run statistical models and train regression models.
Unsupervised  Used to analyze, visualize, run statistical models and train unsupervised models.
ClassificationModelAnalysis  Used to analyze, interpret and visualize results of a Classification model.
RegressionModelAnalysis  Used to analyze, interpret and visualize results of a Regression model.
UnsupervisedModelAnalysis  Used to analyze, interpret and visualize results of a Unsupervised model.
TextModelAnalysis  Used to analyze, interpret and visualize results of a Text model.
Analysis¶
import aethos as at
import pandas as pd
x_train = pd.read_csv('data/train.csv') # load data into pandas
# Initialize Data object with training data
# By default, if no test data (x_test) is provided, then the data is split with 20% going to the test set
#
# Specify predictor field as 'Survived'
df = at.Classification(x_train, target='Survived')
df.x_train # View your training data
df.x_test # View your testing data
df # Glance at your training data
df[df.Age > 25] # Filter the data
df.x_train['new_col'] = [1,2] # This is the exact same as the either of code above
df.x_test['new_col'] = [1,2]
df.data_report(title='Titanic Summary', output_file='titanic_summary.html') # Automate EDA with pandas profiling with an autogenerated report
df.describe() # Display a high level view of your data using an extended version of pandas describe
df.describe_column('Fare') # Get indepth statistics about the 'Fare' column
df.mean() # Run pandas functions on the aethos objects
df.missing_data # View your missing data at anytime
df.correlation_matrix() # Generate a correlation matrix for your training data
df.predictive_power() # Calculates the predictive power of each variable
df.autoviz() # Runs autoviz on the data and runs EDA on your data
df.pairplot() # Generate pairplots for your training data features at any time
df.checklist() # Will provide an iteractive checklist to keep track of your cleaning tasks
NOTE: One of the benefits of using aethos
is that any method you apply on your train set, gets applied to your test dataset. For any method that requires fitting (replacing missing data with mean), the method is fit on the training data and then applied to the testing data to avoid data leakage.
# Replace missing values in the 'Fare' and 'Embarked' column with the most common values in each of the respective columns.
df.replace_missing_mostcommon('Fare', 'Embarked')
# To create a "checkpoint" of your data (i.e. if you just want to test this analytical method), assign it to a variable
df.replace_missing_mostcommon('Fare', 'Embarked')
# Replace missing values in the 'Age' column with a random value that follows the probability distribution of the 'Age' column in the training set.
df.replace_missing_random_discrete('Age')
df.drop('Cabin') # Drop the cabin column
As you’ve started to notice, alot of tasks to df the data and to explore the data have been reduced down to one command, and are also customizable by providing the respective keyword arguments (see documentation).
# Create a barplot of the mean surivial rate grouped by age.
df.barplot(x='Age', y='Survived', method='mean')
# Plots a scatter plot of Age vs. Fare and colours the dots based off the Survived column.
df.scatterplot(x='Age', y='Fare', color='Survived')
# One hot encode the `Person` and `Embarked` columns and then drop the original columns
df.onehot_encode('Person', 'Embarked', drop_col=True)
Modelling¶
Running a Single Model¶
Models can be trained one at a time or multiple at a time. They can also be trained by passing in the params for the sklearn, xgboost, etc constructor, by passing in a gridsearch dictionary & params, cross validating with gridsearch & params.
After a model has been ran, it comes with use cases such as plotting RoC curves, calculating performance metrics, confusion matrices, SHAP plots, decision tree plots and other local and global model interpretability use cases.
lr_model = df.LogisticRegression(random_state=42) # Train a logistic regression model
# Train a logistic regression model with gridsearch
lr_model = df.LogisticRegression(gridsearch={'penalty': ['l1', 'l2']}, random_state=42)
# Crossvalidate a a logistic regression model, displays the scores and the learning curve and builds the model
lr_model = df.LogisticRegression()
lr_model.cross_validate(cv_type="stratkfold", n_splits=10) # default is stratkfold for classification problems
# Build a Logistic Regression model with Gridsearch and then cross validates the best model using stratified KFold cross validation.
lr_model = model.LogisticRegression(gridsearch={'penalty': ['l1', 'l2']}, cv_type="stratkfold")
lr_model.help_debug() # Interface with items to check for to help debug your model.
lr_model.metrics() # Views all metrics for the model
lr_model.confusion_matrix()
lr_model.decision_boundary()
lr_model.roc_curve()
Running Multiple Models¶
# Add a Logistic Regression, Random Forest Classification and a XGBoost Classification model to the queue.
lr = df.LogisticRegression(random_state=42, model_name='log_reg', run=False)
rf = df.RandomForestClassification(run=False)
xgbc = df.XGBoostClassification(run=False)
df.run_models() # This will run all queued models in parallel
df.run_models(method='series') # Run each model one after the other
df.compare_models() # This will display each model evaluated against every metric
# Every model is accessed by a unique name that is assiged when you run the model.
# Default model names can be seen in the function header of each model.
df.log_reg.confusion_matrix() # Displays a confusion matrix for the logistic regression model
df.rf_cls.confusion_matrix() # Displays a confusion matrix for the random forest model
Using Pretrained Models¶
Currently you can use pretrained models such as BERT, XLNet, AlBERT, etc. to calculate sentiment and answer questions.
df.pretrained_sentiment_analysis(`text_column`)
# To answer questions, context for the question has to be supplied
df.pretrained_question_answer(`context_column`, `question_column`)
Model Interpretability¶
As mentioned in the Model section, whenever a model is trained you have access to use cases for model interpretability as well. There are prebuild SHAP usecases and an interactive dashboard that is equipped with LIME and SHAP for local model interpretability and Morris Sensitivity for global model interpretability.
lr_model = model.LogisticRegression(random_state=42)
lr_model.summary_plot() # SHAP summary plot
lr_model.force_plot() # SHAP force plot
lr_model.decision_plot() # SHAP decision plot
lr_model.dependence_plot() # SHAP depencence plot
lr_model.interpret_model() # Creates an interactive dashboard to view LIME, SHAP, Morris Sensitivity and more for your model
Code Generation¶
Currently you are only able to export your model to be ran a service, and will be able to automatically generate the required files. The automatic creation of a data pipeline is still in progress.
lr_model.to_service('titanic')
Now navigate to ‘your_home_folder’(‘~’ on linux and Users/’your_user_name’ on windows)/.aethos/projects/titanic/ and you will see the files needed to run the model as a service using FastAPI and uvicorn.
Installation¶
pip install aethos
To install the dependencies to use pretrained models such as BERT, XLNet, AlBERT, etc:
pip install aethos[ptmodels]
To install associating corpora for nltk analysis:
aethos installcorpora
To install and use the extensions such as qgrid for interactive filtering and analysis with DataFrames:
aethos enableextensions
Currently working on condas implementation.
To create a Data Science project run:
aethos create
This will create a full folder strucuture for you to manage data, unit tests, experiments and source code.
If experiment tracking is enabled or if you want to start the MLFlow UI:
aethos mlflowui
This will start the MLFlow UI in the directory where your Aethos experiemnts are run. NOTE: This only works for local use of MLFLOW, if you are running MLFlow on a remote server, just start it on the remote server and enter the address in the %HOME%/.aethos/config.yml file.
Configuration¶
By default the configuration file is located at %HOME%/.aethos/config.yml
.
You can use the configuration file to specify the full path of where to store reports, images, deployed projects and experiments.
Project Metrics¶
Often in data science projects, you define a metric or metrics to evaluate how well your model performs.
By default when training a model and viewing the results, Aethos calculates all possible metrics for the problem type (Unsupervised, Text, Classification, Regression, etc.).
To change this behaviour it is recommended to set project metrics:
import aethos as at
at.options.project_metrics = ['F1', 'Precision', 'Recall']
Now when comparing models or viewing metrics for models, only the F1 score, precision and recall metrics will be shown and consequently tracked if tracking is enabled.
The supported project metrics are the following:
Classification¶
 Accuracy
 Balanced Accuracy
 Average Precision
 ROC AUC
 Zero One Loss
 Precision
 Recall
 Matthews Correlation Coefficient
 Log Loss
 Jaccard
 Hinge Loss
 Hamming Loss
 FBeta
 F1
 Cohen Kappa
 Brier Loss
 Explained Variance
Regression¶
 Max Error
 Mean Absolute Error
 Mean Squared Error
 Root Mean Sqaured Error
 Mean Squared Log Error
 Median Absolute Error
 R2
 SMAPE
Using MLFlow¶
To start tracking experiments with MLFlow, enable it runing the following:
import aethos as at
at.options.track_experiments = True
Now any models you train will be tracked with MLFlow against all metrics unless you set project metrics, MLFlow will then only track the project metrics.
at.options.project_metrics = ['F1', 'Precision', 'Recall']
To start the MLFlow UI in the directory your experiments are stored run:
aethos mlflowui
Note: This only works for local use of MLFlow, if you are running MLFlow on a remote server, start it on the server and enter in the address in the Aethos config file at %HOME%/.aethos/config.yml.
Analysis API¶

class
aethos.analysis.
Analysis
(x_train, x_test=None, target='')¶ Bases:
aethos.visualizations.visualizations.Visualizations
,aethos.stats.stats.Stats
Core class thats run analytical techniques.
Parameters:  x_train (pd.DataFrame) – Training data or aethos data object
 x_test (pd.DataFrame) – Test data, by default None
 target (str) – For supervised learning problems, the name of the column you’re trying to predict.

autoviz
(max_rows=150000, max_cols=30, verbose=0)¶ Auto visualizes and analyzes your data to help explore your data.
Credits go to AutoViMl  https://github.com/AutoViML/AutoViz
Parameters:  max_rows (int, optional) – Max rows to analyze, by default 150000
 max_cols (int, optional) – Max columns to analyze, by default 30
 verbose ({0, 1, 2}, optional) –
0  it does not print any messages and goes into silent mode 1  print messages on the terminal and also display
charts on terminal 2  it will print messages but will not display charts,
 it will simply save them.

checklist
()¶ Displays a checklist dashboard with reminders for a Data Science project.
Examples
>>> data.checklist()

column_info
(dataset='train')¶ Describes your columns using the DataFrameSummary library with basic descriptive info.
Credits go to @mouradmourafiq for his pandassummary library.
counts uniques missing missing_perc types
Parameters: dataset (str, optional) – Type of dataset to describe. Can either be train or test. If you are using the full dataset it will automatically describe your full dataset no matter the input, by default ‘train’ Returns: Dataframe describing your columns with basic descriptive info Return type: DataFrame Examples
>>> data.column_info()

columns
¶ Property to return columns in the dataset.

copy
()¶ Returns deep copy of object.
Returns: Deep copy of object Return type: Object

correlation_matrix
(data_labels=False, hide_mirror=False, output_file='', **kwargs)¶ Plots a correlation matrix of all the numerical variables.
For more information on possible kwargs please see: https://seaborn.pydata.org/generated/seaborn.heatmap.html
Parameters:  data_labels (bool, optional) – True to display the correlation values in the plot, by default False
 hide_mirror (bool, optional) – Whether to display the mirroring half of the correlation plot, by default False
 output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
Examples
>>> data.correlation_matrix(data_labels=True) >>> data.correlation_matrix(data_labels=True, output_file='corr.png')

data_report
(title='Profile Report', output_file='', suppress=False)¶ Generates a full Exploratory Data Analysis report using Pandas Profiling.
Credits: https://github.com/pandasprofiling/pandasprofiling
For each column the following statistics  if relevant for the column type  are presented in an interactive HTML report:
 Essentials: type, unique values, missing values
 Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
 Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
 Most frequent values
 Histogram
 Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
 Missing values matrix, count, heatmap and dendrogram of missing values
Parameters:  title (str, optional) – Title of the report, by default ‘Profile Report’
 output_file (str, optional) – File name of the output file for the report, by default ‘’
 suppress (bool, optional) – True if you do not want to display the report, by default False
Returns: Return type: HTML display of Exploratory Data Analysis report
Examples
>>> data.data_report() >>> data.data_report(title='Titanic EDA', output_file='titanic.html')

describe
(dataset='train')¶ Describes your dataset using the DataFrameSummary library with basic descriptive info. Extends the DataFrame.describe() method to give more info.
Credits go to @mouradmourafiq for his pandassummary library.
Parameters: dataset (str, optional) – Type of dataset to describe. Can either be train or test. If you are using the full dataset it will automatically describe your full dataset no matter the input, by default ‘train’ Returns: Dataframe describing your dataset with basic descriptive info Return type: DataFrame Examples
>>> data.describe()

describe_column
(column, dataset='train')¶ Analyzes a column and reports descriptive statistics about the columns.
Credits go to @mouradmourafiq for his pandassummary library.
std max min variance mean mode 5% 25% 50% 75% 95% iqr kurtosis skewness sum mad cv zeros_num zeros_perc deviating_of_mean deviating_of_mean_perc deviating_of_median deviating_of_median_perc top_correlations counts uniques missing missing_perc types
Parameters:  column (str) – Column in your dataset you want to analze.
 dataset (str, optional) – Type of dataset to describe. Can either be train or test. If you are using the full dataset it will automatically describe your full dataset no matter the input, by default ‘train’
Returns: Dictionary mapping a statistic and its value for a specific column
Return type: dict
Examples
>>> data.describe_column('col1')

drop
(*drop_columns, keep=[], regexp='', reason='')¶ Drops columns from the dataframe.
Parameters:  keep (list: optional) – List of columns to not drop, by default []
 regexp (str, optional) – Regular Expression of columns to drop, by default ‘’
 reason (str, optional) – Reasoning for dropping columns, by default ‘’
 names must be provided as strings that exist in the data. (Column) –
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.drop('A', 'B', reason="Columns were unimportant") >>> data.drop('col1', keep=['col2'], regexp=r"col*") # Drop all columns that start with "col" except column 2 >>> data.drop(keep=['A']) # Drop all columns except column 'A' >>> data.drop(regexp=r'col*') # Drop all columns that start with 'col'

encode_target
()¶ Encodes target variables with value between 0 and n_classes1.
Running this function will automatically set the corresponding mapping for the target variable mapping number to the original value.
Note that this will not work if your test data will have labels that your train data does not.
Returns: Returns a deep copy of the Data object. Return type: Data Examples
>>> data.encode_target()

expand_json_column
(col)¶ Utility function that expands a column that has JSON elements into columns, where each JSON key is a column.
Parameters: cols (str) – Column in the data that has the nested data. Returns: Returns a deep copy of the Data object. Return type: Data Examples
>>> data.expand_json_column('col1')

features
¶ Features for modelling

interpret_data
(show=True)¶ Interpret your data using MSFT Interpret dashboard.

missing_values
¶ Property function that shows how many values are missing in each column.

predictive_power
(col=None, data_labels=False, hide_mirror=False, output_file='', **kwargs)¶ Calculated the Predictive Power Score of each feature.
If a column is provided, it will calculate it in regards to the target variable.
Credits go to Florian Wetschorek  https://towardsdatascience.com/ripcorrelationintroducingthepredictivepowerscore3d90808b9598
Parameters:  col (str) – Column in the dataframe
 data_labels (bool, optional) – True to display the correlation values in the plot, by default False
 hide_mirror (bool, optional) – Whether to display the mirroring half of the correlation plot, by default False
 output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
Examples
>>> data.predictive_power(data_labels=True) >>> data.predictive_power(col='col1')

standardize_column_names
()¶ Utility function that standardizes all column names to lowercase and underscores for spaces.
Returns: Returns a deep copy of the Data object. Return type: Data Examples
>>> data.standardize_column_names()

to_csv
(name: str, index=False, **kwargs)¶ Write data to csv with the name and path provided.
The function will automatically add ‘.csv’ to the end of the name.
By default it writes 10000 rows at a time to file to consider memory on different machines.
Training data will end in ‘_train.csv’ andt test data will end in ‘_test.csv’.
For a full list of keyword args for writing to csv please see the following link: https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.DataFrame.to_csv.html
Parameters:  name (str) – File path
 index (bool, optional) – True to write ‘index’ column, by default False
Examples
>>> data.to_csv('titanic')

to_df
()¶ Return Dataframes for x_train and x_test if it exists.
Returns: Transformed dataframe with rows with a missing values in a specific column are missing Returns 2 Dataframes test if x_test is provided.
Return type: Dataframe, *Dataframe Examples
>>> data.to_df()

y_test
¶ Property function for the testing predictor variable

y_train
¶ Property function for the training predictor variable

class
aethos.visualizations.visualizations.
Visualizations
¶ Bases:
object

barplot
(x: str, y=None, method=None, asc=None, orient='v', title='', output_file='', **barplot_kwargs)¶ Plots a bar plot for the given columns provided using Plotly.
If groupby is provided, method must be provided for example you may want to plot Age against survival rate, so you would want to groupby Age and then find the mean as the method.
For a list of group by methods please checkout the following pandas link: https://pandas.pydata.org/pandasdocs/stable/reference/groupby.html#computationsdescriptivestats
For a list of possible arguments for the bar plot please checkout the following links: https://plot.ly/pythonapireference/generated/plotly.express.bar.html
Parameters:  x (str) – Column name for the x axis.
 y (str, optional) – Column(s) you would like to see plotted against the x_col
 method (str) – Method to aggregate groupy data Examples: min, max, mean, etc., optional by default None
 asc (bool) – To sort values in ascending order, False for descending
 orient (str (default 'v')) – One of ‘h’ for horizontal or ‘v’ for vertical
 title (str) – The figure title.
 color (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign color to marks.
 hover_name (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in bold in the hover tooltip.
 hover_data (list of str or int, or Series or arraylike) – Either names of columns in data_frame, or pandas Series, or array_like objects Values from these columns appear as extra data in the hover tooltip.
 custom_data (list of str or int, or Series or arraylike) – Either names of columns in data_frame, or pandas Series, or array_like objects Values from these columns are extra data, to be used in widgets or Dash callbacks for example. This data is not uservisible but is included in events emitted by the figure (lasso selection etc.)
 text (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in the figure as text labels.
 animation_frame (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to animation frames.
 animation_group (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to provide objectconstancy across animation frames: rows with matching `animation_group`s will be treated as if they describe the same object in each frame.
 labels (dict with str keys and str values (default {})) – By default, column names are used in the figure for axis titles, legend entries and hovers. This parameter allows this to be overridden. The keys of this dict should correspond to column names, and the values should correspond to the desired label to be displayed.
 color_discrete_sequence (list of str) – Strings should define valid CSScolors. When color is set and the values in the corresponding column are not numeric, values in that column are assigned colors by cycling through color_discrete_sequence in the order described in category_orders, unless the value of color is a key in color_discrete_map. Various useful color sequences are available in the plotly.express.colors submodules, specifically plotly.express.colors.qualitative.
 color_discrete_map (dict with str keys and str values (default {})) – String values should define valid CSScolors Used to override color_discrete_sequence to assign a specific colors to marks corresponding with specific values. Keys in color_discrete_map should be values in the column denoted by color.
 color_continuous_scale (list of str) – Strings should define valid CSScolors. This list is used to build a continuous color scale when the column denoted by color contains numeric data. Various useful color scales are available in the plotly.express.colors submodules, specifically plotly.express.colors.sequential, plotly.express.colors.diverging and plotly.express.colors.cyclical.
 opacity (float) – Value between 0 and 1. Sets the opacity for markers.
 barmode (str (default 'relative')) – One of ‘group’, ‘overlay’ or ‘relative’ In ‘relative’ mode, bars are stacked above zero for positive values and below zero for negative values. In ‘overlay’ mode, bars are drawn on top of one another. In ‘group’ mode, bars are placed beside each other.
 width (int (default None)) – The figure width in pixels.
 height (int (default 600)) – The figure height in pixels.
 output_file (str, optional) – Output html file name for image
Returns: Plotly Figure Object of Bar Plot
Return type: Plotly Figure
Examples
>>> data.barplot(x='x', y='y') >>> data.barplot(x='x', method='mean') >>> data.barplot(x='x', y='y', method='max', orient='h')

boxplot
(x=None, y=None, color=None, title='', output_file='', **kwargs)¶ Plots a box plot for the given x and y columns.
For more info and kwargs for box plots, see https://plot.ly/pythonapireference/generated/plotly.express.box.html#plotly.express.box and https://plot.ly/python/boxplots/
Parameters:  x (str) – X axis column
 y (str) – y axis column
 color (str, optional) – Column name to add a dimension by color.
 orient (str, optional) – Orientation of graph, ‘h’ for horizontal ‘v’ for vertical, by default ‘v’,
 points (str, bool {'outlier', 'suspectedoutliers', 'all', False}) – One of ‘outliers’, ‘suspectedoutliers’, ‘all’, or False. If ‘outliers’, only the sample points lying outside the whiskers are shown. If ‘suspectedoutliers’, all outlier points are shown and those less than 4*Q13*Q3 or greater than 4*Q33*Q1 are highlighted with the marker’s ‘outliercolor’. If ‘outliers’, only the sample points lying outside the whiskers are shown. If ‘all’, all sample points are shown. If False, no sample points are shown and the whiskers extend to the full range of the sample.
 notched (bool, optional) – If True, boxes are drawn with notches, by default False.
 title (str, optional) – Title of the plot, by default “”.
 output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
Returns: Plotly Figure Object of Box Plot
Return type: Plotly Figure
Examples
>>> data.boxplot(y='y', color='z') >>> data.boxplot(x='x', y='y', color='z', points='all') >>> data.boxplot(x='x', y='y', output_file='pair.png')

histogram
(*x, hue=None, plot_test=False, output_file='', **kwargs)¶ Plots a histogram of the given column(s).
If no columns are provided, histograms are plotted for all numeric columns
For more histogram key word arguments, please see https://seaborn.pydata.org/generated/seaborn.distplot.html
Parameters:  x (str or str(s)) – Column(s) to plot histograms for.
 hue (str, optional) – Column to colour points by, by default None
 plot_test (bool, optional) – True to plot distribution of the test data for the same variable
 bins (argument for matplotlib hist(), or None, optional) – Specification of hist bins, or None to use FreedmanDiaconis rule.
 hist (bool, optional) – Whether to plot a (normed) histogram.
 kde (bool, optional) – Whether to plot a gaussian kernel density estimate.
 rug (bool, optional) – Whether to draw a rugplot on the support axis.
 fit (random variable object, optional) – An object with fit method, returning a tuple that can be passed to a pdf method a positional arguments following an grid of values to evaluate the pdf on.
 output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
Examples
>>> data.histogram() >>> data.histogram('col1') >>> data.histogram('col1', 'col2', hue='col3', plot_test=True) >>> data.histogram('col1', kde=False) >>> data.histogram('col1', 'col2', hist=False) >>> data.histogram('col1', kde=False, fit=stat.normal) >>> data.histogram('col1', kde=False, output_file='hist.png')

jointplot
(x: str, y: str, kind='scatter', output_file='', **kwargs)¶ Plots joint plots of 2 different variables.
Scatter (‘scatter’): Scatter plot and histograms of x and y.
Regression (‘reg’): Scatter plot, with regression line and histograms with kernel density fits.
Residuals (‘resid’): Scatter plot of residuals and histograms of residuals.
Kernel Density Estimates (‘kde’): Density estimate plot and histograms.
Hex (‘hex’): Replaces scatterplot with joint histogram using hexagonal bins and histograms on the axes.
For more info and kwargs for joint plots, see https://seaborn.pydata.org/generated/seaborn.jointplot.html
Parameters:  x (str) – X axis column
 y (str) – y axis column
 kind ({ “scatter”  “reg”  “resid”  “kde”  “hex” }, optional) – Kind of plot to draw, by default ‘scatter’
 color (matplotlib color, optional) – Color used for the plot elements.
 dropna (bool, optional) – If True, remove observations that are missing from x and y.
 y}lim ({x,) – Axis limits to set before plotting.
 marginal, annot}_kws ({joint,) – Additional keyword arguments for the plot components.
 output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
Examples
>>> data.jointplot(x='x', y='y', kind='kde', color='crimson') >>> data.jointplot(x='x', y='y', kind='kde', color='crimson', output_file='pair.png')

lineplot
(x: str, y: str, z=None, color=None, title='Line Plot', output_file='', **lineplot_kwargs)¶ Plots a lineplot for the given x and y columns provided using Plotly Express.
For a list of possible lineplot_kwargs please check out the following links:
For 2d:
For 3d:
Parameters:  x (str) – X column name
 y (str) – Column name to plot on the y axis.
 z (str) – Column name to plot on the z axis.
 title (str, optional) – Title of the plot, by default ‘Line Plot’
 color (str) – Category column to draw multiple line plots of
 output_file (str, optional) – Output html file name for image
 text (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in the figure as text labels.
 facet_row (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to facetted subplots in the vertical direction.
 facet_col (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to facetted subplots in the horizontal direction.
 error_x (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size xaxis error bars. If error_x_minus is None, error bars will be symmetrical, otherwise error_x is used for the positive direction only.
 error_x_minus (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size xaxis error bars in the negative direction. Ignored if error_x is None.
 error_y (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size yaxis error bars. If error_y_minus is None, error bars will be symmetrical, otherwise error_y is used for the positive direction only.
 error_y_minus (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size yaxis error bars in the negative direction. Ignored if error_y is None.
 animation_frame (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to animation frames.
 animation_group (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to provide objectconstancy across animation frames: rows with matching `animation_group`s will be treated as if they describe the same object in each frame.
 labels (dict with str keys and str values (default {})) – By default, column names are used in the figure for axis titles, legend entries and hovers. his parameter allows this to be overridden. The keys of this dict should correspond to column names, and the values should correspond to the desired label to be displayed.
 color_discrete_sequence (list of str) – Strings should define valid CSScolors. When color is set and the values in the corresponding column are not numeric, values in that column are assigned colors by cycling through color_discrete_sequence in the order described in category_orders, unless the value of color is a key in color_discrete_map. Various useful color sequences are available in the plotly.express.colors submodules, specifically plotly.express.colors.qualitative.
 color_discrete_map (dict with str keys and str values (default {})) – String values should define valid CSScolors Used to override color_discrete_sequence to assign a specific colors to marks corresponding with specific values. Keys in color_discrete_map should be values in the column denoted by color.
Returns: Plotly Figure Object of Line Plot
Return type: Plotly Figure
Examples
>>> data.line_plot(x='x', y='y') >>> data.line_plot(x='x', y='y', output_file='line')

pairplot
(cols=[], kind='scatter', diag_kind='auto', upper_kind=None, lower_kind=None, hue=None, output_file='', **kwargs)¶ Plots pairplots of the variables from the training data.
If hue is not provided and a target variable is set, the data will separated and highlighted by the classes in that column.
For more info and kwargs on pair plots, please see: https://seaborn.pydata.org/generated/seaborn.pairplot.html
Parameters:  cols (list) – Columns to view pairplot of.
 kind (str {'scatter', 'reg'}, optional) – Type of plot for offdiag plots, by default ‘scatter’
 diag_kind (str {'auto', 'hist', 'kde'}, optional) – Type of plot for diagonal, by default ‘auto’
 upper_kind (str {'scatter', 'kde'}, optional) – Type of plot for upper triangle of pair plot, by default None
 lower_kind (str {'scatter', 'kde'}, optional) – Type of plot for lower triangle of pair plot, by default None
 hue (str, optional) – Column to colour points by, by default None
 y}_vars ({x,) – Variables within data to use separately for the rows and columns of the figure; i.e. to make a nonsquare plot.
 palette (dict or seaborn color palette) – Set of colors for mapping the hue variable. If a dict, keys should be values in the hue variable.
 output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
Examples
>>> data.pairplot(kind='kde') >>> data.pairplot(kind='kde', output_file='pair.png')

pieplot
(values: str, names: str, title='', textposition='inside', textinfo='percent', output_file='', **pieplot_kwargs)¶ Plots a Pie plot of a given column.
For more information regarding pie plots please see the following links: https://plot.ly/python/piecharts/#customizingapiechartcreatedwithpxpie and https://plot.ly/pythonapireference/generated/plotly.express.pie.html#plotly.express.pie.
Parameters:  values (str) – Column in the DataFrame. Values from this column or array_like are used to set values associated to sectors.
 names (str) – Column in the DataFrame. Values from this column or array_like are used as labels for sectors.
 title (str, optional) – The figure title, by default ‘’
 textposition ({'inside', 'outside'}, optional) – Position the text in the plot, by default ‘inside’
 textinfo (str, optional) –
 textinfo’ can take any of the following values, joined with a ‘+’:
 ’label’  displays the label on the segment ‘text’  displays the text on the segment (this can be set separately to the label) ‘value’  displays the value passed into the trace ‘percent’  displayed the computer percentage
 by default 'percent' (,) –
 output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
 color (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign color to marks.
 color_discrete_sequence (list of str) – Strings should define valid CSScolors. When color is set and the values in the corresponding column are not numeric, values in that column are assigned colors by cycling through color_discrete_sequence in the order described in category_orders, unless the value of color is a key in color_discrete_map. Various useful color sequences are available in the plotly.express.colors submodules, specifically plotly.express.colors.qualitative.
 color_discrete_map (dict with str keys and str values (default {})) – String values should define valid CSScolors Used to override color_discrete_sequence to assign a specific colors to marks corresponding with specific values. Keys in color_discrete_map should be values in the column denoted by color.
 hover_name (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in bold in the hover tooltip.
 hover_data (list of str or int, or Series or arraylike) – Either names of columns in data_frame, or pandas Series, or array_like objects. Values from these columns appear as extra data in the hover tooltip.
 custom_data (list of str or int, or Series or arraylike) – Either names of columns in data_frame, or pandas Series, or array_like objects Values from these columns are extra data, to be used in widgets or Dash callbacks for example. This data is not uservisible but is included in events emitted by the figure (lasso selection etc.)
 labels (dict with str keys and str values (default {})) – By default, column names are used in the figure for axis titles, legend entries and hovers. This parameter allows this to be overridden. The keys of this dict should correspond to column names, and the values should correspond to the desired label to be displayed.
 width (int (default None)) – The figure width in pixels.
 height (int (default 600)) – The figure height in pixels.
 opacity (float) – Value between 0 and 1. Sets the opacity for markers.
 hole (float) – Value between 0 and 1. Sets the size of the hole in the middle of the pie chart.
Returns: Plotly Figure Object of Pie Chart
Return type: Plotly Figure
Examples
>>> data.pieplot(val_column, name_column)

plot_colorpalettes
¶ Displays color palette configuration guide.

plot_colors
¶ Displays all plot colour names

plot_dim_reduction
(col: str, dim=2, algo='tsne', output_file='', **kwargs)¶ Reduce the dimensions of your data and then view similarly grouped data points (clusters)
For 2d plotting options, see:
For 3d plotting options, see:
Parameters:  col (str) – Column name of the labels/data points to highlight in the plot
 dim (int {2, 3}) – Dimensions of the plot to show, either 2d or 3d, by default 2
 algo (str {'tsne', 'lle', 'pca', 'tsvd'}, optional) – Algorithm to reduce the dimensions by, by default ‘tsne’
 output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
 kwargs – See plotting options
Returns: Plotly Figure Object of Scatter Plot
Return type: Plotly Figure
Examples
>>> data.plot_dim_reduction('cluster_labels', dim=3)

raincloud
(x=None, y=None, output_file='', **params)¶ Combines the box plot, scatter plot and split violin plot into one data visualization. This is used to offer eyeballed statistical inference, assessment of data distributions (useful to check assumptions), and the raw data itself showing outliers and underlying patterns.
A raincloud is made of: 1) “Cloud”, kernel desity estimate, the half of a violinplot. 2) “Rain”, a stripplot below the cloud 3) “Umberella”, a boxplot 4) “Thunder”, a pointplot connecting the mean of the different categories (if pointplot is True)
https://seaborn.pydata.org/generated/seaborn.boxplot.html
https://seaborn.pydata.org/generated/seaborn.violinplot.html
https://seaborn.pydata.org/generated/seaborn.stripplot.html
Parameters:  x (str) – X axis data, reference by column name, any data
 y (str) – Y axis data, reference by column name, measurable data (numeric) by default target
 hue (Iterable, np.array, or dataframe column name if 'data' is specified) – Second categorical data. Use it to obtain different clouds and rainpoints
 output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
 orient (str) – vertical if “v” (default), horizontal if “h”
 width_viol (float) – width of the cloud
 width_box (float) – width of the boxplot
 palette (list or dict) – Colours to use for the different levels of categorical variables
 bw (str or float) – Either the name of a reference rule or the scale factor to use when computing the kernel bandwidth, by default “scott”
 linewidth (float) – width of the lines
 cut (float) – Distance, in units of bandwidth size, to extend the density past the extreme datapoints. Set to 0 to limit the violin range within the range of the observed data, by default 2
 scale (str) – The method used to scale the width of each violin. If area, each violin will have the same area. If count, the width of the violins will be scaled by the number of observations in that bin. If width, each violin will have the same width. By default “area”
 jitter (float, True/1) – Amount of jitter (only along the categorical axis) to apply. This can be useful when you have many points and they overlap, so that it is easier to see the distribution. You can specify the amount of jitter (half the width of the uniform random variable support), or just use True for a good default.
 move (float) – adjust rain position to the xaxis (default value 0.)
 offset (float) – adjust cloud position to the xaxis
 color (matplotlib color) – Color for all of the elements, or seed for a gradient palette.
 ax (matplotlib axes) – Axes object to draw the plot onto, otherwise uses the current Axes.
 figsize ((int, int)) – size of the visualization, ex (12, 5)
 pointplot (bool) – line that connects the means of all categories, by default False
 dodge (bool) – When hue nesting is used, whether elements should be shifted along the categorical axis.
 Source (https://micahallen.org/2018/03/15/introducingraincloudplots/) –
Examples
>>> data.raincloud('col1') # Will plot col1 values on the x axis and your target variable values on the y axis >>> data.raincloud('col1', 'col2') # Will plot col1 on the x and col2 on the y axis >>> data.raincloud('col1', 'col2', output_file='raincloud.png')

scatterplot
(x=None, y=None, z=None, color=None, title='Scatter Plot', output_file='', **scatterplot_kwargs)¶ Plots a scatterplot for the given x and y columns provided using Plotly Express.
For a list of possible scatterplot_kwargs for 2 dimensional data please check out the following links:
For more information on key word arguments for 3d data, please check them out here:
Parameters:  x (str) – X column name
 y (str) – Y column name
 z (str) – Z column name,
 color (str, optional) – Category to group your data, by default None
 title (str, optional) – Title of the plot, by default ‘Scatter Plot’
 output_file (str, optional) – Output html file name for image
 symbol (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign symbols to marks.
 size (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign mark sizes.
 hover_name (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in bold in the hover tooltip.
 hover_data (list of str or int, or Series or arraylike, or dict)) – Either a list of names of columns in data_frame, or pandas Series, or array_like objects or a dict with column names as keys, with values True (for default formatting) False (in order to remove this column from hover information), or a formatting string, for example ‘:.3f’ or ‘%a’ or listlike data to appear in the hover tooltip or tuples with a bool or formatting string as first element, and listlike data to appear in hover as second element Values from these columns appear as extra data in the hover tooltip.
 custom_data (list of str or int, or Series or arraylike) – Either names of columns in data_frame, or pandas Series, or array_like objects Values from these columns are extra data, to be used in widgets or Dash callbacks for example. This data is not uservisible but is included in events emitted by the figure (lasso selection etc.)
 text (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in the figure as text labels.
 facet_row (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to facetted subplots in the vertical direction.
 facet_col (str or int or Series or arraylike)) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to facetted subplots in the horizontal direction.
 facet_col_wrap (int) – Maximum number of facet columns. Wraps the column variable at this width, so that the column facets span multiple rows. Ignored if 0, and forced to 0 if facet_row or a marginal is set.
 error_x (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size xaxis error bars. If error_x_minus is None, error bars will be symmetrical, otherwise error_x is used for the positive direction only.
 error_x_minus (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size xaxis error bars in the negative direction. Ignored if error_x is None.
 error_y (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size yaxis error bars. If error_y_minus is None, error bars will be symmetrical, otherwise error_y is used for the positive direction only.
 error_y_minus (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size yaxis error bars in the negative direction. Ignored if error_y is None.
 animation_frame (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to animation frames.
 animation_group (str or int or Series or arraylike) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to provide objectconstancy across animation frames: rows with matching `animation_group`s will be treated as if they describe the same object in each frame.
 labels (dict with str keys and str values (default {})) – By default, column names are used in the figure for axis titles, legend entries and hovers. This parameter allows this to be overridden. The keys of this dict should correspond to column names, and the values should correspond to the desired label to be displayed.
 color_discrete_sequence (list of str) – Strings should define valid CSScolors. When color is set and the values in the corresponding column are not numeric, values in that column are assigned colors by cycling through color_discrete_sequence in the order described in category_orders, unless the value of color is a key in color_discrete_map. Various useful color sequences are available in the plotly.express.colors submodules, specifically plotly.express.colors.qualitative.
 color_discrete_map (dict with str keys and str values (default {})) – String values should define valid CSScolors Used to override color_discrete_sequence to assign a specific colors to marks corresponding with specific values. Keys in color_discrete_map should be values in the column denoted by color.
 color_continuous_scale (list of str) – Strings should define valid CSScolors This list is used to build a continuous color scale when the column denoted by color contains numeric data. Various useful color scales are available in the plotly.express.colors submodules, specifically plotly.express.colors.sequential, plotly.express.colors.diverging and plotly.express.colors.cyclical.
 range_color (list of two numbers) – If provided, overrides autoscaling on the continuous color scale.
 color_continuous_midpoint (number (default None)) – If set, computes the bounds of the continuous color scale to have the desired midpoint. Setting this value is recommended when using plotly.express.colors.diverging color scales as the inputs to color_continuous_scale.
 opacity (float) – Value between 0 and 1. Sets the opacity for markers.
 size_max (int (default 20)) – Set the maximum mark size when using size.
 marginal_x (str) – One of ‘rug’, ‘box’, ‘violin’, or ‘histogram’. If set, a horizontal subplot is drawn above the main plot, visualizing the xdistribution.
 marginal_y (str) – One of ‘rug’, ‘box’, ‘violin’, or ‘histogram’. If set, a vertical subplot is drawn to the right of the main plot, visualizing the ydistribution.
 trendline (str) – One of ‘ols’ or ‘lowess’. If ‘ols’, an Ordinary Least Squares regression line will be drawn for each discretecolor/symbol group. If ‘lowess’, a Locally Weighted Scatterplot Smoothing line will be drawn for each discretecolor/symbol group.
 trendline_color_override (str)) – Valid CSS color. If provided, and if trendline is set, all trendlines will be drawn in this color.
 log_x (boolean (default False)) – If True, the xaxis is logscaled in cartesian coordinates.
 log_y (boolean (default False)) – If True, the yaxis is logscaled in cartesian coordinates.
 range_x (list of two numbers) – If provided, overrides autoscaling on the xaxis in cartesian coordinates.
 range_y (list of two numbers) – If provided, overrides autoscaling on the yaxis in cartesian coordinates.
 width (int (default None)) – The figure width in pixels.
 height (int (default None)) – The figure height in pixels.
Returns: Plotly Figure Object of Scatter Plot
Return type: Plotly Figure
Examples
>>> data.scatterplot(x='x', y='y') #2d >>> data.scatterplot(x='x', y='y', z='z') #3d >>> data.scatterplot(x='x', y='y', z='z', output_file='scatt')

violinplot
(x=None, y=None, color=None, title='', output_file='', **kwargs)¶ Plots a violin plot for the given x and y columns.
For more info and kwargs for violin plots, see https://plot.ly/pythonapireference/generated/plotly.express.violin.html#plotly.express.violin and https://plot.ly/python/violin/
Parameters:  x (str) – X axis column
 y (str) – y axis column
 color (str, optional) – Column name to add a dimension by color.
 orient (str, optional) – Orientation of graph, ‘h’ for horizontal ‘v’ for vertical, by default ‘v’,
 points (str, bool {'outlier', 'suspectedoutliers', 'all', False}) – One of ‘outliers’, ‘suspectedoutliers’, ‘all’, or False. If ‘outliers’, only the sample points lying outside the whiskers are shown. If ‘suspectedoutliers’, all outlier points are shown and those less than 4*Q13*Q3 or greater than 4*Q33*Q1 are highlighted with the marker’s ‘outliercolor’. If ‘outliers’, only the sample points lying outside the whiskers are shown. If ‘all’, all sample points are shown. If False, no sample points are shown and the whiskers extend to the full range of the sample.
 violinmode (str {'group', 'overlay'}) – In ‘overlay’ mode, violins are on drawn top of one another. In ‘group’ mode, violins are placed beside each other.
 box (bool, optional) – If True, boxes are drawn inside the violins.
 title (str, optional) – Title of the plot, by default “”.
 output_file (str, optional) – Output file name for image with extension (i.e. jpeg, png, etc.)
Returns: Plotly Figure Object of Violin Plot
Return type: Plotly Figure
Examples
>>> data.violinplot(y='y', color='z', box=True) >>> data.violinplot(x='x', y='y', color='z', points='all') >>> data.violinplot(x='x', y='y', violinmode='overlay', output_file='pair.png')


class
aethos.cleaning.
Clean
¶ Bases:
object

drop_column_missing_threshold
(threshold: float)¶ Remove columns from the dataframe that have greater than or equal to the threshold value of missing values. Example: Remove columns where >= 50% of the data is missing.
Parameters: threshold (float) – Value between 0 and 1 that describes what percentage of a column can be missing values. Returns: Returns a deep copy of the Data object. Return type: Data Examples
>>> data.drop_column_missing_threshold(0.5)

drop_constant_columns
()¶ Remove columns from the data that only have one unique value.
Returns: Returns a deep copy of the Data object. Return type: Data Examples
>>> data.drop_constant_columns()

drop_duplicate_columns
()¶ Remove columns from the data that are exact duplicates of each other and leave only 1.
Returns: Returns a deep copy of the Data object. Return type: Data Examples
>>> data.drop_duplicate_columns()

drop_duplicate_rows
(*list_args, list_of_cols=[])¶ Remove rows from the data that are exact duplicates of each other and leave only 1. This can be used to reduce processing time or performance for algorithms where duplicates have no effect on the outcome (i.e DBSCAN)
If a list of columns is provided use the list, otherwise use arguemnts.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.drop_duplicate_rows('col1', 'col2') # Only look at columns 1 and 2 >>> data.drop_duplicate_rows(['col1', 'col2']) >>> data.drop_duplicate_rows()

drop_rows_missing_threshold
(threshold: float)¶ Remove rows from the dataframe that have greater than or equal to the threshold value of missing rows. Example: Remove rows where > 50% of the data is missing.
Parameters: threshold (float) – Value between 0 and 1 that describes what percentage of a row can be missing values. Returns: Returns a deep copy of the Data object. Return type: Data Examples
>>> data.drop_rows_missing_threshold(0.5)

drop_unique_columns
()¶ Remove columns from the data that only have one unique value.
Returns: Returns a deep copy of the Data object. Return type: Data Examples
>>> data.drop_unique_columns()

replace_missing_backfill
(*list_args, list_of_cols=[], **extra_kwargs)¶ Replaces missing values in a column with the next known data point.
This is useful when dealing with timeseries data and you want to replace data in the past with data from the future.
For more info view the following link: https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.DataFrame.fillna.html
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.replace_missing_backfill('col1', 'col2') >>> data.replace_missing_backfill(['col1', 'col2'])

replace_missing_constant
(*list_args, list_of_cols=[], constant=0, col_mapping=None)¶ Replaces missing values in every numeric column with a constant.
If no columns are supplied, missing values will be replaced with the mean in every numeric column.
If a list of columns is provided use the list, otherwise use arguemnts.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 constant (int or float, optional) – Numeric value to replace all missing values with , by default 0
 col_mapping (dict, optional) – Dictionary mapping {‘ColumnName’: constant}, by default None
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.replace_missing_constant(col_mapping={'a': 1, 'b': 2, 'c': 3}) >>> data.replace_missing_constant('col1', 'col2', constant=2) >>> data.replace_missing_constant(['col1', 'col2'], constant=3)

replace_missing_forwardfill
(*list_args, list_of_cols=[], **extra_kwargs)¶ Replaces missing values in a column with the last known data point.
This is useful when dealing with timeseries data and you want to replace future missing data with the past.
For more info view the following link: https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.DataFrame.fillna.html
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.replace_missing_forwardfill('col1', 'col2') >>> data.replace_missing_forwardfill(['col1', 'col2'])

replace_missing_indicator
(*list_args, list_of_cols=[], missing_indicator=1, valid_indicator=0, keep_col=True)¶ Adds a new column describing whether data is missing for each record in a column.
This is useful if the missing data has meaning, aka not random.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 missing_indicator (int, optional) – Value to indicate missing data, by default 1
 valid_indicator (int, optional) – Value to indicate non missing data, by default 0
 keep_col (bool, optional) – True to keep column, False to replace it, by default False
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.replace_missing_indicator('col1', 'col2') >>> data.replace_missing_indicator(['col1', 'col2']) >>> data.replace_missing_indicator(['col1', 'col2'], missing_indicator='missing', valid_indicator='not missing', keep_col=False)

replace_missing_interpolate
(*list_args, list_of_cols=[], method='linear', **inter_kwargs)¶ Replaces missing values with an interpolation method and possible extrapolation.
The possible interpolation methods are:
 ‘linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
 ‘time’: Works on daily and higher resolution data to interpolate given length of interval.
 ‘index’, ‘values’: use the actual numerical values of the index.
 ‘pad’: Fill in NaNs using existing values.
 ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d.
 These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method=’polynomial’, order=5).
 ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’: Wrappers around the SciPy interpolation methods of similar names.
 ‘from_derivatives’: Refers to scipy.interpolate.BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.
For more information see: https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.Series.interpolate.html or https://docs.scipy.org/doc/scipy/reference/interpolate.html.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 method (str, optional) – Interpolation method, by default ‘linear’
 limit (int, optional) – Maximum number of consecutive NaNs to fill. Must be greater than 0.
 limit_area ({None, ‘inside’, ‘outside’}, default None) –
If limit is specified, consecutive NaNs will be filled with this restriction.
 None: No fill restriction.
 ‘inside’: Only fill NaNs surrounded by valid values (interpolate).
 ‘outside’: Only fill NaNs outside valid values (extrapolate).
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.replace_missing_interpolate('col1', 'col2') >>> data.replace_missing_interpolate(['col1', 'col2']) >>> data.replace_missing_interpolate('col1', 'col2', method='pad', limit=3)

replace_missing_knn
(k=5, **knn_kwargs)¶ Replaces missing data with data from similar records based off a distance metric.
For more info see: https://scikitlearn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer
Parameters:  missing_values (number, string, np.nan or None, default=`np.nan`) – The placeholder for the missing values. All occurrences of missing_values will be imputed.
 k (int, default=5) – Number of neighboring samples to use for imputation.
 weights ({‘uniform’, ‘distance’} or callable, default=’uniform’) –
Weight function used in prediction. Possible values:
‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
callable : a userdefined function which accepts an array of distances, and returns an array of the same shape containing the weights.
 metric ({‘nan_euclidean’} or callable, default=’nan_euclidean’) –
Distance metric for searching neighbors. Possible values:
‘nan_euclidean’callable : a userdefined function which conforms to the definition of _pairwise_callable(X, Y, metric, **kwds). The function accepts two arrays, X and Y, and a missing_values keyword in kwds and returns a scalar distance value.
 add_indicator (bool, default=False) – If True, a MissingIndicator transform will stack onto the output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.replace_missing_knn(k=8)

replace_missing_mean
(*list_args, list_of_cols=[])¶ Replaces missing values in every numeric column with the mean of that column.
If no columns are supplied, missing values will be replaced with the mean in every numeric column.
Mean: Average value of the column. Effected by outliers.
If a list of columns is provided use the list, otherwise use arguemnts.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to
 list_of_cols (list, optional) – Specific columns to apply this technique to, by default []
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.replace_missing_mean('col1', 'col2') >>> data.replace_missing_mean(['col1', 'col2'])

replace_missing_median
(*list_args, list_of_cols=[])¶ Replaces missing values in every numeric column with the median of that column.
If no columns are supplied, missing values will be replaced with the mean in every numeric column.
Median: Middle value of a list of numbers. Equal to the mean if data follows normal distribution. Not effected much by anomalies.
If a list of columns is provided use the list, otherwise use arguemnts.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – Specific columns to apply this technique to., by default []
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.replace_missing_median('col1', 'col2') >>> data.replace_missing_median(['col1', 'col2'])

replace_missing_mostcommon
(*list_args, list_of_cols=[])¶ Replaces missing values in every numeric column with the most common value of that column
Mode: Most common value.
If a list of columns is provided use the list, otherwise use arguemnts.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.replace_missing_mostcommon('col1', 'col2') >>> data.replace_missing_mostcommon(['col1', 'col2'])

replace_missing_new_category
(*list_args, list_of_cols=[], new_category=None, col_mapping=None)¶ Replaces missing values in categorical column with its own category. The categories can be autochosen from the defaults set.
For numeric categorical columns default values are: 1, 999, 9999 For string categorical columns default values are: “Other”, “Unknown”, “MissingDataCategory”
If a list of columns is provided use the list, otherwise use arguemnts.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 new_category (str, int, or float, optional) – Category to replace missing values with, by default None
 col_mapping (dict, optional) – Dictionary mapping {‘ColumnName’: constant}, by default None
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.replace_missing_new_category(col_mapping={'col1': "Green", 'col2': "Canada", 'col3': "December"}) >>> data.replace_missing_new_category('col1', 'col2', 'col3', new_category='Blue') >>> data.replace_missing_new_category(['col1', 'col2', 'col3'], new_category='Blue')

replace_missing_random_discrete
(*list_args, list_of_cols=[])¶ Replace missing values in with a random number based off the distribution (number of occurences) of the data.
For example if your data was [5, 5, NaN, 1, 2] There would be a 50% chance that the NaN would be replaced with a 5, a 25% chance for 1 and a 25% chance for 2.
If a list of columns is provided use the list, otherwise use arguemnts.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.replace_missing_random_discrete('col1', 'col2') >>> data.replace_missing_random_discrete(['col1', 'col2'])

replace_missing_remove_row
(*list_args, list_of_cols=[])¶ Remove rows where the value of a column for those rows is missing.
If a list of columns is provided use the list, otherwise use arguemnts.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.replace_missing_remove_row('col1', 'col2') >>> data.replace_missing_remove_row(['col1', 'col2'])


class
aethos.preprocessing.
Preprocess
¶ Bases:
object

clean_text
(*list_args, list_of_cols=[], lower=True, punctuation=True, stopwords=True, stemmer=True, numbers=True, new_col_name='_clean')¶ Function that takes text and does the following:
 Casts it to lowercase
 Removes punctuation
 Removes stopwords
 Stems the text
 Removes any numerical text
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 lower (bool, optional) – True to cast all text to lowercase, by default True
 punctuation (bool, optional) – True to remove punctuation, by default True
 stopwords (bool, optional) – True to remove stop words, by default True
 stemmer (bool, optional) – True to stem the data, by default True
 numbers (bool, optional) – True to remove numerical data, by default True
 new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_clean
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.clean_text('col1') >>> data.clean_text(['col1', 'col2'], lower=False) >>> data.clean_text(lower=False, stopwords=False, stemmer=False)

normalize_log
(*list_args, list_of_cols=[], base=1)¶ Scales data logarithmically.
Options are 1 for natural log, 2 for base2, 10 for base10.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 base (str, optional) – Base to logarithmically scale by, by default ‘’
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.normalize_log('col1') >>> data.normalize_log(['col1', 'col2'], base=10)

normalize_numeric
(*list_args, list_of_cols=[], **normalize_params)¶ Function that normalizes all numeric values between 2 values to bring features into same domain.
If list_of_cols is not provided, the strategy will be applied to all numeric columns.
If a list of columns is provided use the list, otherwise use arguments.
For more info please see: https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 feature_range (tuple(int or float, int or float), optional) – Min and max range to normalize values to, by default (0, 1)
 normalize_params (dict, optional) – Parmaters to pass into MinMaxScaler() constructor from ScikitLearn
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.normalize_numeric('col1') >>> data.normalize_numeric(['col1', 'col2'])

normalize_quantile_range
(*list_args, list_of_cols=[], **robust_params)¶ Scale features using statistics that are robust to outliers.
This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.
If list_of_cols is not provided, the strategy will be applied to all numeric columns.
If a list of columns is provided use the list, otherwise use arguments.
For more info please see: https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 with_centering (boolean, True by default) – If True, center the data before scaling. This will cause transform to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
 with_scaling (boolean, True by default) – If True, scale the data to interquartile range.
 quantile_range (tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0) – Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR Quantile range used to calculate scale_.
 robust_params (dict, optional) – Parmaters to pass into MinMaxScaler() constructor from ScikitLearn
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.normalize_quantile_range('col1') >>> data.normalize_quantile_range(['col1', 'col2'])

remove_numbers
(*list_args, list_of_cols=[], new_col_name='_rem_num')¶ Removes numbers from text in a column.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_num
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.remove_numbers('col1', new_col_name="text_wo_num)

remove_punctuation
(*list_args, list_of_cols=[], regexp='', exceptions=[], new_col_name='_rem_punct')¶ Removes punctuation from every string entry.
Defaults to removing all punctuation, but if regex of punctuation is provided, it will remove them.
An example regex would be:
(w+.w+)[^,]  Include all words and words with periods after them but don’t include commas. (w+.)(w+), would also achieve the same result
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 regexp (str, optional) – Regex expression used to define what to include.
 exceptions (list, optional) – List of punctuation to include in the text, by default []
 new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_punct
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.remove_punctuation('col1') >>> data.remove_punctuation(['col1', 'col2']) >>> data.remove_punctuation('col1', regexp=r'(\w+\.)(\w+)') # Include all words and words with periods after.

remove_stopwords_nltk
(*list_args, list_of_cols=[], custom_stopwords=[], new_col_name='_rem_stop')¶ Removes stopwords following the nltk English stopwords list.
A list of custom words can be provided as well, usually for domain specific words.
Stop words are generally the most common words in a language
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 custom_stop_words (list, optional) – Custom list of words to also drop with the stop words, must be LOWERCASE, by default []
 new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_stop
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.remove_stopwords_nltk('col1') >>> data.remove_stopwords_nltk(['col1', 'col2'])

split_sentences
(*list_args, list_of_cols=[], new_col_name='_sentences')¶ Splits text data into sentences and saves it into another column for analysis.
If a list of columns is provided use the list, otherwise use arguments.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_sentences
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.split_sentences('col1') >>> data.split_sentences(['col1', 'col2'])

split_words_nltk
(*list_args, list_of_cols=[], regexp='', new_col_name='_tokenized')¶ Splits text into its words using nltk punkt tokenizer by default.
Default is by spaces and punctuation but if a regex expression is provided, it will use that.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 regexp (str, optional) – Regex expression used to define what a word is.
 new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_tokenized
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.split_words_nltk('col1') >>> data.split_words_nltk(['col1', 'col2'])

stem_nltk
(*list_args, list_of_cols=[], stemmer='porter', new_col_name='_stemmed')¶ Transforms text to their word stem, base or root form. For example:
dogs –> dog churches –> church abaci –> abacusParameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 stemmer (str, optional) –
Type of NLTK stemmer to use, by default porter
 Current stemming implementations:
 porter
 snowball
For more information please refer to the NLTK stemming api https://www.nltk.org/api/nltk.stem.html
 new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_stemmed
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.stem_nltk('col1') >>> data.stem_nltk(['col1', 'col2'], stemmer='snowball')


class
aethos.feature_engineering.
Feature
¶ Bases:
object

apply
(func, output_col: str)¶ Calls pandas apply function. Will apply the function to your dataset, or both your training and testing dataset.
Parameters:  func (Function pointer) – Function describing the transformation for the new column
 output_col (str) – New column name
Returns: Returns a deep copy of the Feature object.
Return type: Examples
>>> col1 col2 col3 0 1 0 1 1 0 2 0 2 1 0 1 >>> data.apply(lambda x: x['col1'] > 0, 'col4') >>> col1 col2 col3 col4 0 1 0 1 1 1 0 2 0 0 2 1 0 1 1

bag_of_words
(*list_args, list_of_cols=[], keep_col=True, **bow_kwargs)¶ Creates a matrix of how many times a word appears in a document.
The premise is that the more times a word appears the more the word represents that document.
For more information see: https://scikitlearn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
If a list of columns is provided use the list, otherwise use arguments.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 keep_col (bool, optional) – True if you want to keep the column(s) or False if you want to drop the column(s)
 encoding (str, default=’utf8’) – If bytes or files are given to analyze, this encoding is used to decode.
 decode_error ({‘strict’, ‘ignore’, ‘replace’} (default=’strict’)) – Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.
 strip_accents ({‘ascii’, ‘unicode’, None} (default=None)) –
Remove accents and perform other character normalization during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.
Both ‘ascii’ and ‘unicode’ use NFKD normalization from unicodedata.normalize.
 lowercase (bool (default=True)) – Convert all characters to lowercase before tokenizing.
 preprocessor (callable or None (default=None)) – Override the preprocessing (string transformation) stage while preserving the tokenizing and ngrams generation steps. Only applies if analyzer is not callable.
 tokenizer (callable or None (default=None)) – Override the string tokenization step while preserving the preprocessing and ngrams generation steps. Only applies if analyzer == ‘word’.
 analyzer (str, {‘word’, ‘char’, ‘char_wb’} or callable) –
Whether the feature should be made of word or character ngrams Option ‘char_wb’ creates character ngrams only from text inside word boundaries; ngrams at the edges of words are padded with space.
If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
 stop_words (str {‘english’}, list, or None (default=None)) –
If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value. There are several known issues with ‘english’ and you should consider an alternative (see Using stop words).
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == ‘word’.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.
 token_pattern (str) – Regular expression denoting what constitutes a “token”, only used if analyzer == ‘word’. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
 ngram_range (tuple (min_n, max_n), default=(1, 1)) – The lower and upper boundary of the range of nvalues for different ngrams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.
 max_df (float in range [0.0, 1.0] or int (default=1.0)) – When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpusspecific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
 min_df (float in range [0.0, 1.0] or int (default=1)) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cutoff in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
 max_features (int or None (default=None)) –
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
 vocabulary (Mapping or iterable, optional (default=None)) – Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.
 binary (bool (default=False)) – If True, all nonzero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tfidf is binary. (Set idf and normalization to False to get 0/1 outputs).
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.bag_of_words('col1', 'col2', 'col3') >>> data.bag_of_words('col1', 'col2', 'col3', binary=True)

chi2_feature_selection
(k: int, verbose=False)¶ Uses Chi2 to choose the best K features.
The Chi2 null hypothesis is that 2 variables are independent.
Chisquare test feature selection “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification.
Parameters:  k (int or "all") – Number of features to keep.
 verbose (bool) – True to print pvalues for each feature, by default False
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.chi2_feature_selection(k=10)
Drop features that have a correlation coefficient greater than the specified threshold with other features.
Parameters: threshold (float, optional) – Correlation coefficient threshold, by default 0.95 Returns: Returns a deep copy of the Data object. Return type: Data Examples
>>> data.drop_correlated_features(threshold=0.9)

nounphrases_nltk
(*list_args, list_of_cols=[], new_col_name='_phrases')¶ Extract noun phrases from text using the Textblob packages which uses the NLTK NLP engine.
If a list of columns is provided use the list, otherwise use arguments.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_phrases
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.nounphrases_nltk('col1', 'col2', 'col3')

nounphrases_spacy
(*list_args, list_of_cols=[], new_col_name='_phrases')¶ Extract noun phrases from text using the Textblob packages which uses the NLTK NLP engine.
If a list of columns is provided use the list, otherwise use arguments.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_phrases
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.nounphrases_spacy('col1', 'col2', 'col3')

onehot_encode
(*list_args, list_of_cols=[], keep_col=True, **onehot_kwargs)¶ Creates a matrix of converted categorical columns into binary columns of ones and zeros.
For more info see: https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
If a list of columns is provided use the list, otherwise use arguments.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 keep_col (bool) – A parameter to specify whether to drop the column being transformed, by default keep the column, True
 categories (‘auto’ or a list of arraylike, default=’auto’) –
Categories (unique values) per feature:
‘auto’ : Determine categories automatically from the training data.list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.
The used categories can be found in the categories_ attribute.
 drop (‘first’ or a arraylike of shape (n_features,), default=None) –
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.
None : retain all features (the default).‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
array : drop[i] is the category in feature X[:, i] that should be dropped.
 sparsebool (default=True) – Will return sparse matrix if set True else will return an array.
 dtype (number type, default=np.float) – Desired dtype of output.
 handle_unknown ({‘error’, ‘ignore’}, default='ignore') – Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting onehot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.onehot_encode('col1', 'col2', 'col3') >>> data.onehot_encode('col1', 'col2', 'col3', drop='first')

ordinal_encode_labels
(col: str, ordered_cat=[])¶ Encode categorical values with value between 0 and n_classes1.
Running this function will automatically set the corresponding mapping for the target variable mapping number to the original value.
Note: that this will not work if your test data will have labels that your train data does not. Note:
Parameters:  col (str, optional) – Columnm in the data to ordinally encode.
 ordered_cat (list, optional) – A list of ordered categories for the Ordinal encoder. Should be sorted.
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.encode_labels('col1') >>> data.encode_labels('col1', ordered_cat=["Low", "Medium", "High"])

pca
(n_components=10, **pca_kwargs)¶ Reduces the dimensionality of the data using Principal Component Analysis.
Use PCA when the data is dense.
This can be used to reduce complexity as well as speed up computation.
For more info please see: https://scikitlearn.org/stable/modules/generated/sklearn.decomposition.PCA.html
This function exists in featureextraction/util.py
Parameters:  n_components (int, float, None or string, by default 10) –
Number of components to keep.
If n_components is not set all components are kept. If n_components == ‘mle’ and svd_solver == ‘full’, Minka’s MLE is used to guess the dimension. Use of n_components == ‘mle’ will interpret svd_solver == ‘auto’ as svd_solver == ‘full’ If 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components If svd_solver == ‘arpack’, the number of components must be strictly less than the minimum of n_features and n_samples
 whiten (bool, optional (default False)) – When True (False by default) the components_ vectors are multiplied by the square root of n_samples and then divided by the singular values to ensure uncorrelated outputs with unit componentwise variances. Whitening will remove some information from the transformed signal (the relative variance scales of the components) but can sometime improve the predictive accuracy of the downstream estimators by making their data respect some hardwired assumptions.
 svd_solver (string {‘auto’, ‘full’, ‘arpack’, ‘randomized’}) –
 auto :
 the solver is selected by a default policy based on X.shape and n_components: if the input data is larger than 500x500 and the number of components to extract is lower than 80% of the smallest dimension of the data, then the more efficient ‘randomized’ method is enabled. Otherwise the exact full SVD is computed and optionally truncated afterwards.
 full :
 run exact full SVD calling the standard LAPACK solver via scipy.linalg.svd and select the components by postprocessing
 arpack :
 run SVD truncated to n_components calling ARPACK solver via scipy.sparse.linalg.svds. It requires strictly 0 < n_components < min(X.shape)
 randomized :
 run randomized SVD by the method of Halko et al.
 tol (float >= 0, optional (default .0)) – Tolerance for singular values computed by svd_solver == ‘arpack’.
 iterated_power (int >= 0, or ‘auto’, (default ‘auto’)) – Number of iterations for the power method computed by svd_solver == ‘randomized’.
 random_state (int, RandomState instance or None, optional (default None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when svd_solver == ‘arpack’ or ‘randomized’.
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.pca(n_components=2)
 n_components (int, float, None or string, by default 10) –

polynomial_features
(*list_args, list_of_cols=[], **poly_kwargs)¶ Generate polynomial and interaction features.
Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.
For example, if an input sample is two dimensional and of the form [a, b], the degree2 polynomial features are [1, a, b, a^2, ab, b^2].
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 degree (int) – Degree of the polynomial features, by default 2
 interaction_only (boolean,) – If true, only interaction features are produced: features that are products of at most degree distinct input features (so not x[1] ** 2, x[0] * x[2] ** 3, etc.). by default = False
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.polynomial_features('col1', 'col2', 'col3')

postag_nltk
(*list_args, list_of_cols=[], new_col_name='_postagged')¶ Tag documents with their respective “Part of Speech” tag with the Textblob package which utilizes the NLTK NLP engine and Penn Treebank tag set. These tags classify a word as a noun, verb, adjective, etc. A full list and their meaning can be found here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
If a list of columns is provided use the list, otherwise use arguments.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_postagged
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.postag_nltk('col1', 'col2', 'col3')

postag_spacy
(*list_args, list_of_cols=[], new_col_name='_postagged')¶ Tag documents with their respective “Part of Speech” tag with the Spacy NLP engine and the Universal Dependencies scheme. These tags classify a word as a noun, verb, adjective, etc. A full list and their meaning can be found here: https://spacy.io/api/annotation#postagging
If a list of columns is provided use the list, otherwise use arguments.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_postagged
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.postag_spacy('col1', 'col2', 'col3')

postag_spacy_detailed
(*list_args, list_of_cols=[], new_col_name='_postagged')¶ Tag documents with their respective “Part of Speech” tag with the Spacy NLP engine and the PennState PoS tags. These tags classify a word as a noun, verb, adjective, etc. A full list and their meaning can be found here: https://spacy.io/api/annotation#postagging
If a list of columns is provided use the list, otherwise use arguments.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_postagged
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.postag_spacy_detailed('col1', 'col2', 'col3')

text_hash
(*list_args, list_of_cols=[], keep_col=True, **hash_kwargs)¶ Creates a matrix of how many times a word appears in a document. It can possibly normalized as token frequencies if norm=’l1’ or projected on the euclidean unit sphere if norm=’l2’.
The premise is that the more times a word appears the more the word represents that document.
This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.
This strategy has several advantages:
It is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory It is fast to pickle and unpickle as it holds no state besides the constructor parameters It can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.For more info please see: https://scikitlearn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
If a list of columns is provided use the list, otherwise use arguments.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 keep_col (bool, optional) – True if you want to keep the column(s) or False if you want to drop the column(s)
 n_features (integer, default=(2 ** 20)) – The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.
 hash_kwargs (dict, optional) – Parameters you would pass into Bag of Words constructor, by default {}
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.text_hash('col1', 'col2', 'col3') >>> data.text_hash('col1', 'col2', 'col3', n_features=50)

tfidf
(*list_args, list_of_cols=[], keep_col=True, **tfidf_kwargs)¶ Creates a matrix of the tfidf score for every word in the corpus as it pertains to each document.
The higher the score the more important a word is to a document, the lower the score (relative to the other scores) the less important a word is to a document.
For more information see: https://scikitlearn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
If a list of columns is provided use the list, otherwise use arguments.
Parameters:  list_args (str(s), optional) – Specific columns to apply this technique to.
 list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
 keep_col (bool, optional) – True if you want to keep the column(s) or False if you want to drop the column(s)
 encoding (str, default=’utf8’) – If bytes or files are given to analyze, this encoding is used to decode.
 decode_error ({‘strict’, ‘ignore’, ‘replace’} (default=’strict’)) – Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.
 strip_accents ({‘ascii’, ‘unicode’, None} (default=None)) –
Remove accents and perform other character normalization during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters. None (default) does nothing.
Both ‘ascii’ and ‘unicode’ use NFKD normalization from unicodedata.normalize.
 lowercase (bool (default=True)) – Convert all characters to lowercase before tokenizing.
 preprocessor (callable or None (default=None)) – Override the preprocessing (string transformation) stage while preserving the tokenizing and ngrams generation steps. Only applies if analyzer is not callable.
 tokenizer (callable or None (default=None)) – Override the string tokenization step while preserving the preprocessing and ngrams generation steps. Only applies if analyzer == ‘word’.
 analyzer (str, {‘word’, ‘char’, ‘char_wb’} or callable) –
Whether the feature should be made of word or character ngrams Option ‘char_wb’ creates character ngrams only from text inside word boundaries; ngrams at the edges of words are padded with space.
If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.
 stop_words (str {‘english’}, list, or None (default=None)) –
If a string, it is passed to _check_stop_list and the appropriate stop list is returned. ‘english’ is currently the only supported string value. There are several known issues with ‘english’ and you should consider an alternative (see Using stop words).
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == ‘word’.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.
 token_pattern (str) – Regular expression denoting what constitutes a “token”, only used if analyzer == ‘word’. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
 ngram_range (tuple (min_n, max_n), default=(1, 1)) – The lower and upper boundary of the range of nvalues for different ngrams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.
 max_df (float in range [0.0, 1.0] or int (default=1.0)) – When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpusspecific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
 min_df (float in range [0.0, 1.0] or int (default=1)) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cutoff in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
 max_features (int or None (default=None)) –
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
 vocabulary (Mapping or iterable, optional (default=None)) – Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.
 binary (bool (default=False)) – If True, all nonzero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tfidf is binary. (Set idf and normalization to False to get 0/1 outputs).
 dtype (type, optional (default=float64)) – Type of the matrix returned by fit_transform() or transform().
 norm (‘l1’, ‘l2’ or None, optional (default=’l2’)) – Each output row will have unit norm, either: * ‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied. * ‘l1’: Sum of absolute values of vector elements is 1.
 use_idf (bool (default=True)) – Enable inversedocumentfrequency reweighting.
 smooth_idf (bool (default=True)) – Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
 sublinear_tf (bool (default=False)) – Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.tfidf('col1', 'col2', 'col3') >>> data.tfidf('col1', 'col2', 'col3', lowercase=False, smoothidf=False)

truncated_svd
(n_components=50, **svd_kwargs)¶ Reduces the dimensionality of the data using Truncated SVD.
In particular, truncated SVD works on term count/tfidf matrices. In that context, it is known as latent semantic analysis (LSA).
Use Truncated SVD when the data is sparse.
This can be used to reduce complexity as well as speed up computation.
For more info please see: https://scikitlearn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
This function exists in featureextraction/util.py
Parameters:  n_components (int, default = 2) – Desired dimensionality of output data. Must be strictly less than the number of features. The default value is useful for visualisation. For LSA, a value of 100 is recommended.
 algorithm (string, default = “randomized”) – SVD solver to use. Either “arpack” for the ARPACK wrapper in SciPy (scipy.sparse.linalg.svds), or “randomized” for the randomized algorithm due to Halko (2009).
 n_iter (int, optional (default 5)) – Number of iterations for randomized SVD solver. Not used by ARPACK. The default is larger than the default in ~sklearn.utils.extmath.randomized_svd to handle sparse matrices that may have large slowly decaying spectrum.
 tol (float, optional) – Tolerance for ARPACK. 0 means machine precision. Ignored by randomized SVD solver.
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.truncated_svd(n_components=2)


class
aethos.stats.stats.
Stats
¶ Bases:
object

anova
(dep_var: str, num_variables=[], cat_variables=[], formula=None, verbose=False)¶ Runs an anova.
Anovas are to be used when one wants to compare the means of a condition between 2+ groups.
ANOVA tests if there is a difference in the mean somewhere in the model (testing if there was an overall effect), but it does not tell one where the difference is if the there is one.
Parameters:  dep_var (str) – Dependent variable you want to explore the relationship of
 num_variables (list, optional) – Numeric variable columns, by default []
 cat_variables (list, optional) – Categorical variable columns, by default []
 formula (str, optional) – OLS formula statsmodel lib, by default None
 verbose (bool, optional) – True to print OLS model summary and formula, by default False
Examples
>>> data.anova('dep_col', num_variables=['col1', 'col2'], verbose=True) >>> data.anova('dep_col', cat_variables=['col1', 'col2'], verbose=True) >>> data.anova('dep_col', num_variables=['col1', 'col2'], cat_variables=['col3'] verbose=True)

ind_ttest
(group1: str, group2: str, equal_var=True, output_file=None)¶ Performs an Independent T test.
This is to be used when you want to compare the means of 2 groups.
If group 2 column name is not provided and there is a test set, it will compare the same column in the train and test set.
If there are any NaN’s they will be omitted.
Parameters:  group1 (str) – Column for group 1 to compare.
 group2 (str, optional) – Column for group 2 to compare, by default None
 equal_var (bool, optional) – If True (default), perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s ttest, which does not assume equal population variance, by default True
 output_file (str, optional) – Name of the file to output, by default None
Returns: T test statistic, P value
Return type: list
Examples
>>> data.ind_ttest('col1', 'col2') >>> data.ind_ttest('col1', 'col2', output_file='ind_ttest.png')

ks_feature_distribution
(threshold=0.1, show_plots=True)¶ Uses the KolomogorovSmirnov test see if the distribution in the training and test sets are similar.
Parameters:  threshold (float, optional) – KS statistic threshold, by default 0.1
 show_plots (bool, optional) – True to show histograms of feature distributions, by default True
Returns: Columns that are significantly different in the train and test set.
Return type: DataFrame
Examples
>>> data.ks_feature_distribution() >>> data.ks_feature_distribution(threshold=0.2)

most_common
(col: str, n=15, plot=False, use_test=False, output_file='', **plot_kwargs)¶ Analyzes the most common values in the column and either prints them or displays a bar chart.
Parameters:  col (str) – Column to analyze
 n (int, optional) – Number of top most common values to display, by default 15
 plot (bool, optional) – True to plot a bar chart, by default False
 use_test (bool, optional) – True to analyze the test set, by default False
 output_file (str,) – File name to save plot as, IF plot=True
Examples
>>> data.most_common('col1', plot=True) >>> data.most_common('col1', n=50, plot=True) >>> data.most_common('col1', n=50)

onesample_ttest
(group1: str, mean: Union[float, int], output_file=None)¶ Performs a One Sample ttest.
This is to be used when you want to compare the mean of a single group against a known mean.
If there are any NaN’s they will be omitted.
Parameters:  group1 (str) – Column for group 1 to compare.
 mean (float, int, optional) – Sample mean to compare to.
 output_file (str, optional) – Name of the file to output, by default None
Returns: T test statistic, P value
Return type: list
Examples
>>> data.onesample_ttest('col1', 1) >>> data.onesample_ttest('col1', 1, output_file='ones_ttest.png')

paired_ttest
(group1: str, group2=None, output_file=None)¶ Performs a Paired ttest.
This is to be used when you want to compare the means from the same group at different times.
If group 2 column name is not provided and there is a test set, it will compare the same column in the train and test set.
If there are any NaN’s they will be omitted.
Parameters:  group1 (str) – Column for group 1 to compare.
 group2 (str, optional) – Column for group 2 to compare, by default None
 equal_var (bool, optional) – If True (default), perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s ttest, which does not assume equal population variance, by default True
 output_file (str, optional) – Name of the file to output, by default None
Returns: T test statistic, P value
Return type: list
Examples
>>> data.paired_ttest('col1', 'col2') >>> data.paired_ttest('col1', 'col2', output_file='pair_ttest.png')

predict_data_sample
()¶ Identifies how similar the train and test set distribution are by trying to predict whether each sample belongs to the train or test set using Random Forest, 10 Fold Stratified Cross Validation.
The lower the F1 score, the more similar the distributions are as it’s harder to predict which sample belongs to which distribution.
Returns: Returns a deep copy of the Data object. Return type: Data Examples
>>> data.predict_data_sample()

Model API¶

class
aethos.modelling.model.
ModelBase
(x_train, target, x_test=None, test_split_percentage=0.2, exp_name='myexperiment')¶ Bases:
object

Doc2Vec
(col_name, prep=False, model_name='d2v', run=True, **kwargs)¶ The underlying assumption of Word2Vec is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model. For instance: “dog”, “puppy” and “pup” are often used in similar situations, with similar surrounding words like “good”, “fluffy” or “cute”, and according to Word2Vec they will therefore share a similar vector representation.
From this assumption, Word2Vec can be used to find out the relations between words in a dataset, compute the similarity between them, or use the vector representation of those words as input for other applications such as text classification or clustering.
For more information on word2vec, you can view it here https://radimrehurek.com/gensim/models/word2vec.html.
Parameters:  col_name (str, optional) – Column name of text data that you want to summarize
 prep (bool, optional) – True to prep the data. Use when passing in raw text data. False if passing in text that is already prepped. By default False
 model_name (str, optional) – Name for this model, default to model_extract_keywords_gensim
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 dm ({1,0}, optional) – Defines the training algorithm. If dm=1, ‘distributed memory’ (PVDM) is used. Otherwise, distributed bag of words (PVDBOW) is employed.
 vector_size (int, optional) – Dimensionality of the feature vectors.
 window (int, optional) – The maximum distance between the current and predicted word within a sentence.
 alpha (float, optional) – The initial learning rate.
 min_alpha (float, optional) – Learning rate will linearly drop to min_alpha as training progresses.
 min_count (int, optional) – Ignores all words with total frequency lower than this.
 max_vocab_size (int, optional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
 sample (float, optional) – The threshold for configuring which higherfrequency words are randomly downsampled, useful range is (0, 1e5).
 workers (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).
 epochs (int, optional) – Number of iterations (epochs) over the corpus.
 hs ({1,0}, optional) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is nonzero, negative sampling will be used.
 negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 520). If set to 0, no negative sampling is used.
 ns_exponent (float, optional) – The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples lowfrequency words more than highfrequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, CasellesDupré, Lesaint, & RoyoLetelier suggest that other values may perform better for recommendation applications.
 dm_mean ({1,0}, optional) – If 0 , use the sum of the context word vectors. If 1, use the mean. Only applies when dm is used in nonconcatenative mode.
 dm_concat ({1,0}, optional) – If 1, use concatenation of context vectors rather than sum/average; Note concatenation results in a muchlarger model, as the input is no longer the size of one (sampled or arithmetically combined) word vector, but the size of the tag(s) and all words in the context strung together.
 dm_tag_count (int, optional) – Expected constant number of document tags per document, when using dm_concat mode.
 dbow_words ({1,0}, optional) – If set to 1 trains wordvectors (in skipgram fashion) simultaneous with DBOW docvector training; If 0, only trains docvectors (faster).
 trim_rule (function, optional) –
Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during current method call and is not stored as part of the model.
The input parameters are of the following types:
word (str)  the word we are examiningcount (int)  the word’s frequency count in the corpus
min_count (int)  the minimum count threshold.
Returns: Resulting model
Return type: Examples
>>> model.Doc2Vec('col1', prep=True) >>> model.Doc2Vec('col1', run=False) # Add model to the queue

LDA
(col_name, prep=False, model_name='lda', run=True, **kwargs)¶ Extracts topics from your data using Latent Dirichlet Allocation.
For more information on LDA, you can view it here https://radimrehurek.com/gensim/models/ldamodel.html.
Parameters:  col_name (str, optional) – Column name of text data that you want to summarize
 prep (bool, optional) – True to prep the data. Use when passing in raw text data. False if passing in text that is already prepped. By default False
 model_name (str, optional) – Name for this model, default to lda
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 num_topics ((int, optional)) – The number of requested latent topics to be extracted from the training corpus.
 distributed ((bool, optional)) – Whether distributed computing should be used to accelerate training.
 chunksize ((int, optional)) – Number of documents to be used in each training chunk.
 passes ((int, optional)) – Number of passes through the corpus during training.
 update_every ((int, optional)) – Number of documents to be iterated through for each update. Set to 0 for batch learning, > 1 for online iterative learning.
 alpha (({numpy.ndarray, str}, optional)) –
Can be set to an 1D array of length equal to the number of expected topics that expresses our apriori belief for the each topics’ probability. Alternatively default prior selecting strategies can be employed by supplying a string:
’asymmetric’: Uses a fixed normalized asymmetric prior of 1.0 / topicno.’auto’: Learns an asymmetric prior from the corpus (not available if distributed==True).
 eta (({float, np.array, str}, optional)) –
Apriori belief on word probability, this can be:
scalar for a symmetric prior over topic/word probability,vector of length num_words to denote an asymmetric user defined probability for each word,
matrix of shape (num_topics, num_words) to assign a probability for each wordtopic combination,
the string ‘auto’ to learn the asymmetric prior from the data.
 decay ((float, optional)) – A number between (0.5, 1] to weight what percentage of the previous lambda value is forgotten when each new document is examined. Corresponds to Kappa from Matthew D. Hoffman, David M. Blei, Francis Bach: “Online Learning for Latent Dirichlet Allocation NIPS’10”.
 offset ((float, optional)) – Hyperparameter that controls how much we will slow down the first steps the first few iterations. Corresponds to Tau_0 from Matthew D. Hoffman, David M. Blei, Francis Bach: “Online Learning for Latent Dirichlet Allocation NIPS’10”.
 eval_every ((int, optional)) – Log perplexity is estimated every that many updates. Setting this to one slows down training by ~2x.
 iterations ((int, optional)) – Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.
 gamma_threshold ((float, optional)) – Minimum change in the value of the gamma parameters to continue iterating.
 minimum_probability ((float, optional)) – Topics with a probability lower than this threshold will be filtered out.
 random_state (({np.random.RandomState, int}, optional)) – Either a randomState object or a seed to generate one. Useful for reproducibility.
 ns_conf ((dict of (str, object), optional)) – Key word parameters propagated to gensim.utils.getNS() to get a Pyro4 Nameserved. Only used if distributed is set to True.
 minimum_phi_value ((float, optional)) – if per_word_topics is True, this represents a lower bound on the term probabilities.per_word_topics (bool) – If True, the model also computes a list of topics, sorted in descending order of most likely topics for each word, along with their phi values multiplied by the feature length (i.e. word count).
Returns: Resulting model
Return type: Examples
>>> model.LDA('col1', prep=True) >>> model.LDA('col1', run=False) # Add model to the queue

Word2Vec
(col_name, prep=False, model_name='w2v', run=True, **kwargs)¶ The underlying assumption of Word2Vec is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model. For instance: “dog”, “puppy” and “pup” are often used in similar situations, with similar surrounding words like “good”, “fluffy” or “cute”, and according to Word2Vec they will therefore share a similar vector representation.
From this assumption, Word2Vec can be used to find out the relations between words in a dataset, compute the similarity between them, or use the vector representation of those words as input for other applications such as text classification or clustering.
For more information on word2vec, you can view it here https://radimrehurek.com/gensim/models/word2vec.html.
Parameters:  col_name (str, optional) – Column name of text data that you want to summarize
 prep (bool, optional) – True to prep the data. Use when passing in raw text data. False if passing in text that is already prepped. By default False
 model_name (str, optional) – Name for this model, default to model_extract_keywords_gensim
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 size (int, optional) – Dimensionality of the word vectors.
 window (int, optional) – Maximum distance between the current and predicted word within a sentence.
 min_count (int, optional) – Ignores all words with total frequency lower than this.
 int, optional (workers) – Use these many worker threads to train the model (=faster training with multicore machines).
 sg ({0, 1}, optional) – Training algorithm: 1 for skipgram; otherwise CBOW.
 hs ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is nonzero, negative sampling will be used.
 negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 520). If set to 0, no negative sampling is used.
 ns_exponent (float, optional) – The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples lowfrequency words more than highfrequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212, CasellesDupré, Lesaint, & RoyoLetelier suggest that other values may perform better for recommendation applications.
 cbow_mean ({0, 1}, optional) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.
 alpha (float, optional) – The initial learning rate.
 min_alpha (float, optional) – Learning rate will linearly drop to min_alpha as training progresses.
 max_vocab_size (int, optional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
 max_final_vocab (int, optional) – Limits the vocab to a target vocab size by automatically picking a matching min_count. If the specified min_count is more than the calculated min_count, the specified min_count will be used. Set to None if not required.
 sample (float, optional) – The threshold for configuring which higherfrequency words are randomly downsampled, useful range is (0, 1e5).
 hashfxn (function, optional) – Hash function to use to randomly initialize weights, for increased training reproducibility.
 workers (int, optional) – Use these many worker threads to train the model (=faster training with multicore machines).
 iter (int, optional) – Number of iterations (epochs) over the corpus.
 trim_rule (function, optional) –
Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item()), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD, gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT. The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.
The input parameters are of the following types:
word (str)  the word we are examiningcount (int)  the word’s frequency count in the corpus
min_count (int)  the minimum count threshold.
 sorted_vocab ({0, 1}, optional) – If 1, sort the vocabulary by descending frequency before assigning word indexes. See sort_vocab().
 batch_words (int, optional) – Target size (in words) for batches of examples passed to worker threads (and thus cython routines). (Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)
 compute_loss (bool, optional) – If True, computes and stores loss value which can be retrieved using get_latest_training_loss().
Returns: Resulting model
Return type: Examples
>>> model.Word2Vec('col1', prep=True) >>> model.Word2Vec('col1', run=False) # Add model to the queue

columns
¶ Property to return columns in the dataset.

compare_models
()¶ Compare different models across every known metric for that model.
Returns: Dataframe of every model and metrics associated for that model Return type: Dataframe Examples
>>> model.compare_models()

copy
()¶ Returns deep copy of object.
Returns: Deep copy of object Return type: Object

delete_model
(name)¶ Deletes a model, specified by it’s name  can be viewed by calling list_models.
Will look in both queued and ran models and delete where it’s found.
Parameters: name (str) – Name of the model Examples
>>> model.delete_model('model1')

extract_keywords_gensim
(*list_args, list_of_cols=[], new_col_name='_extracted_keywords', model_name='model_extracted_keywords_gensim', run=True, **keyword_kwargs)¶ Extracts keywords using Gensim’s implementation of the Text Rank algorithm.
Get most ranked words of provided text and/or its combinations.
Parameters:  list_of_cols (list, optional) – Column name(s) of text data that you want to summarize
 new_col_name (str, optional) – New column name to be created when applying this technique, by default _extracted_keywords
 model_name (str, optional) – Name for this model, default to model_extract_keywords_gensim
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 ratio (float, optional) – Number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary.
 words (int, optional) – Number of returned words.
 split (bool, optional) – If True, list of sentences will be returned. Otherwise joined strings will be returned.
 scores (bool, optional) – Whether score of keyword.
 pos_filter (tuple, optional) – Part of speech filters.
 lemmatize (bool, optional) – If True  lemmatize words.
 deacc (bool, optional) – If True  remove accentuation.
Returns: Resulting model
Return type: Examples
>>> model.extract_keywords_gensim('col1') >>> model.extract_keywords_gensim('col1', run=False) # Add model to the queue

features
¶ Features for modelling

help_debug
()¶ Displays a tips for helping debugging model outputs and how to deal with over and underfitting.
Credit: Andrew Ng’s and his book Machine Learning Yearning
Examples
>>> model.help_debug()

list_models
()¶ Prints out all queued and ran models.
Examples
>>> model.list_models()

pretrained_question_answer
(context_col: str, question_col: str, model_type=None, new_col_name='qa', run=True)¶ Uses Huggingface’s pipeline to automatically run Q&A analysis on text.
The default model is ‘tf_distil_bert_for_question_answering_2’
Possible model types are:
 bertbaseuncased
 bertlargeuncased
 bertbasecased
 bertlargecased
 bertbasemultilingualuncased
 bertbasemultilingualcased
 bertbasechinese
 bertbasegermancased
 bertlargeuncasedwholewordmasking
 bertlargecasedwholewordmasking
 bertlargeuncasedwholewordmaskingfinetunedsquad
 bertlargecasedwholewordmaskingfinetunedsquad
 bertbasecasedfinetunedmrpc
 bertbasegermandbmdzcased
 bertbasegermandbmdzuncased
 bertbasejapanese
 bertbasejapanesewholewordmasking
 bertbasejapanesechar
 bertbasejapanesecharwholewordmasking
 bertbasefinnishcasedv1
 bertbasefinnishuncasedv1
 openaigpt
 gpt2
 gpt2medium
 gpt2large
 gpt2xl
 transfoxlwt103
 xlnetbasecased
 xlnetlargecased
 xlmmlmen2048
 xlmmlmende1024
 xlmmlmenfr1024
 xlmmlmenro1024
 xlmmlmxnli151024
 xlmmlmtlmxnli151024
 xlmclmenfr1024
 xlmclmende1024
 xlmmlm171280
 xlmmlm1001280
 robertabase
 robertalarge
 robertalargemnli
 distilrobertabase
 robertabaseopenaidetector
 robertalargeopenaidetector
 distilbertbaseuncased
 distilbertbaseuncaseddistilledsquad
 distilgpt2
 distilbertbasegermancased
 distilbertbasemultilingualcased
 ctrl
 camembertbase
 ALBERT
 albertbasev1
 albertlargev1
 albertxlargev1
 albertxxlargev1
 albertbasev2
 albertlargev2
 albertxlargev2
 albertxxlargev2
 t5small
 t5base
 t5large
 t53B
 t511B
 xlmrobertabase
 xlmrobertalarge
Parameters:  context_col (str) – Column name that contains the context for the question
 question_col (str) – Column name of the question
 model_type (str, optional) – Type of model, by default None
 new_col_name (str, optional) – New column name for the sentiment scores, by default “sent_score”
Returns: Return type: TF or PyTorch of model
Examples
>>> m.pretrained_question_answer('col1', 'col2') >>> m.pretrained_question_answer('col1', 'col2' model_type='albertbasev1')

pretrained_sentiment_analysis
(col: str, model_type=None, new_col_name='sent_score', run=True)¶ Uses Huggingface’s pipeline to automatically run sentiment analysis on text.
The default model is ‘tf_distil_bert_for_sequence_classification_2’
Possible model types are:
 bertbaseuncased
 bertlargeuncased
 bertbasecased
 bertlargecased
 bertbasemultilingualuncased
 bertbasemultilingualcased
 bertbasechinese
 bertbasegermancased
 bertlargeuncasedwholewordmasking
 bertlargecasedwholewordmasking
 bertlargeuncasedwholewordmaskingfinetunedsquad
 bertlargecasedwholewordmaskingfinetunedsquad
 bertbasecasedfinetunedmrpc
 bertbasegermandbmdzcased
 bertbasegermandbmdzuncased
 bertbasejapanese
 bertbasejapanesewholewordmasking
 bertbasejapanesechar
 bertbasejapanesecharwholewordmasking
 bertbasefinnishcasedv1
 bertbasefinnishuncasedv1
 openaigpt
 gpt2
 gpt2medium
 gpt2large
 gpt2xl
 transfoxlwt103
 xlnetbasecased
 xlnetlargecased
 xlmmlmen2048
 xlmmlmende1024
 xlmmlmenfr1024
 xlmmlmenro1024
 xlmmlmxnli151024
 xlmmlmtlmxnli151024
 xlmclmenfr1024
 xlmclmende1024
 xlmmlm171280
 xlmmlm1001280
 robertabase
 robertalarge
 robertalargemnli
 distilrobertabase
 robertabaseopenaidetector
 robertalargeopenaidetector
 distilbertbaseuncased
 distilbertbaseuncaseddistilledsquad
 distilgpt2
 distilbertbasegermancased
 distilbertbasemultilingualcased
 ctrl
 camembertbase
 albertbasev1
 albertlargev1
 albertxlargev1
 albertxxlargev1
 albertbasev2
 albertlargev2
 albertxlargev2
 albertxxlargev2
 t5small
 t5base
 t5large
 t53B
 t511B
 xlmrobertabase
 xlmrobertalarge
Parameters:  col (str) – Column of text to get sentiment analysis
 model_type (str, optional) – Type of model, by default None
 new_col_name (str, optional) – New column name for the sentiment scores, by default “sent_score”
Returns: Return type: TF or PyTorch of model
Examples
>>> m.pretrained_sentiment_analysis('col1') >>> m.pretrained_sentiment_analysis('col1', model_type='albertbasev1')

run_models
(method='parallel')¶ Runs all queued models.
The models can either be run one after the other (‘series’) or at the same time in parallel.
Parameters: method (str, optional) – How to run models, can either be in ‘series’ or in ‘parallel’, by default ‘parallel’ Examples
>>> model.run_models() >>> model.run_models(method='series')

summarize_gensim
(*list_args, list_of_cols=[], new_col_name='_summarized', model_name='model_summarize_gensim', run=True, **summarizer_kwargs)¶ Summarize bodies of text using Gensim’s Text Rank algorithm. Note that it uses a Text Rank variant as stated here: https://radimrehurek.com/gensim/summarization/summariser.html
The output summary will consist of the most representative sentences and will be returned as a string, divided by newlines.
Parameters:  list_of_cols (list, optional) – Column name(s) of text data that you want to summarize
 new_col_name (str, optional) – New column name to be created when applying this technique, by default _extracted_keywords
 model_name (str, optional) – Name for this model, default to model_summarize_gensim
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 ratio (float, optional) – Number between 0 and 1 that determines the proportion of the number of sentences of the original text to be chosen for the summary.
 word_count (int or None, optional) – Determines how many words will the output contain. If both parameters are provided, the ratio will be ignored.
 split (bool, optional) – If True, list of sentences will be returned. Otherwise joined strings will be returned.
Returns: Resulting model
Return type: Examples
>>> model.summarize_gensim('col1') >>> model.summarize_gensim('col1', run=False) # Add model to the queue

test_data
¶ Testing data used to evaluate models

to_pickle
(name: str)¶ Writes model to a pickle file.
Parameters: name (str) – Name of the model Examples
>>> m = Model(df) >>> m.LogisticRegression() >>> m.to_pickle('log_reg')

to_service
(model_name: str, project_name: str)¶ Creates an app.py, requirements.txt and Dockerfile in ~/.aethos/projects and the necessary folder structure to run the model as a microservice.
Parameters:  model_name (str) – Name of the model to create a microservice of.
 project_name (str) – Name of the project that you want to create.
Examples
>>> m = Model(df) >>> m.LogisticRegression() >>> m.to_service('log_reg', 'your_proj_name')

train_data
¶ Training data used for modelling

y_test
¶ Property function for the testing predictor variable


class
aethos.modelling.classification_models.
Classification
(x_train, target, x_test=None, test_split_percentage=0.2, exp_name='myexperiment')¶ Bases:
aethos.modelling.model.ModelBase
,aethos.analysis.Analysis
,aethos.cleaning.clean.Clean
,aethos.preprocessing.preprocess.Preprocess
,aethos.feature_engineering.feature.Feature
,aethos.visualizations.visualizations.Visualizations
,aethos.stats.stats.Stats

ADABoostClassification
(cv_type=None, gridsearch=None, score='accuracy', model_name='ada_cls', run=True, verbose=1, **kwargs)¶ Trains an AdaBoost classification model.
An AdaBoost classifier is a metaestimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
For more AdaBoost info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type ({kfold, stratkfold}, Crossvalidation Generator, optional) – Cross validation method, by default None
 gridsearch (dict, optional) – Parameters to gridsearch, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “ada_cls”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 base_estimator (object, optional (default=None)) – The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes. If None, then the base estimator is DecisionTreeClassifier(max_depth=1)
 n_estimators (integer, optional (default=50)) – The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.
 learning_rate (float, optional (default=1.)) – Learning rate shrinks the contribution of each classifier by learning_rate. There is a tradeoff between learning_rate and n_estimators.
Returns: ClassificationModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.AdaBoostClassification() >>> model.AdaBoostClassification(model_name='rc_1, learning_rate=0.001) >>> model.AdaBoostClassification(cv_type='kfold') >>> model.AdaBoostClassification(gridsearch={'n_estimators': [50, 100]}, cv_type='stratkfold') >>> model.AdaBoostClassification(run=False) # Add model to the queue

BaggingClassification
(cv_type=None, gridsearch=None, score='accuracy', model_name='bag_cls', run=True, verbose=1, **kwargs)¶ Trains a Bagging classification model.
A Bagging classifier is an ensemble metaestimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a metaestimator can typically be used as a way to reduce the variance of a blackbox estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.
For more Bagging Classifier info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type ({kfold, stratkfold}, Crossvalidation Generator, optional) – Cross validation method, by default None
 gridsearch (dict, optional) – Parameters to gridsearch, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “bag_cls”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 base_estimator (object or None, optional (default=None)) – The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.
 n_estimators (int, optional (default=10)) – The number of base estimators in the ensemble.
 max_samples (int or float, optional (default=1.0)) –
The number of samples to draw from X to train each base estimator.
If int, then draw max_samples samples. If float, then draw max_samples * X.shape[0] samples.  max_features (int or float, optional (default=1.0)) –
The number of features to draw from X to train each base estimator.
If int, then draw max_features features. If float, then draw max_features * X.shape[1] features.  bootstrap (boolean, optional (default=True)) – Whether samples are drawn with replacement. If False, sampling without replacement is performed.
 bootstrap_features (boolean, optional (default=False)) – Whether features are drawn with replacement.
 oob_score (bool, optional (default=False)) – Whether to use outofbag samples to estimate the generalization error.
Returns: ClassificationModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.BaggingClassification() >>> model.BaggingClassification(model_name='m1', n_estimators=100) >>> model.BaggingClassification(cv_type='kfold') >>> model.BaggingClassification(gridsearch={'n_estimators':[100, 200]}, cv_type='stratkfold') >>> model.BaggingClassification(run=False) # Add model to the queue

BernoulliClassification
(cv_type=None, gridsearch=None, score='accuracy', model_name='bern', run=True, verbose=1, **kwargs)¶ Trains a Bernoulli Naive Bayes classification model.
Like MultinomialNB, this classifier is suitable for discrete data. The difference is that while MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolean features.
For more Bernoulli Naive Bayes info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB and https://scikitlearn.org/stable/modules/naive_bayes.html#gaussiannaivebayes
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type (bool, optional) – If True run crossvalidation on the model, by default False.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “bern”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 alpha (float, optional (default=1.0)) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
 binarize (float or None, optional (default=0.0)) – Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
 fit_prior (boolean, optional (default=True)) – Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
 class_prior (arraylike, size=[n_classes,], optional (default=None)) – Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
Returns: ClassificationModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.BernoulliClassification() >>> model.BernoulliClassification(model_name='m1', binarize=0.5) >>> model.BernoulliClassification(cv_type='kfold') >>> model.BernoulliClassification(gridsearch={'fit_prior':[True, False]}, cv_type='stratkfold') >>> model.BernoulliClassification(run=False) # Add model to the queue

DecisionTreeClassification
(cv_type=None, gridsearch=None, score='accuracy', model_name='dt_cls', run=True, verbose=1, **kwargs)¶ Trains a Decision Tree classification model.
For more Decision Tree info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “dt_cls”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 criterion (string, optional (default=”gini”)) – The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
 splitter (string, optional (default=”best”)) – The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
 max_depth (int or None, optional (default=None)) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
 min_samples_split (int, float, optional (default=2)) –
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.  min_samples_leaf (int, float, optional (default=1)) –
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.  max_features (int, float, string or None, optional (default=None)) –
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split. If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
 max_leaf_nodes (int or None, optional (default=None)) – Grow a tree with max_leaf_nodes in bestfirst fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
 min_impurity_decrease (float, optional (default=0.)) –
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
 N_t / N * (impurity  N_t_R / N_t * right_impurity
 N_t_L / N_t * left_impurity)
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.
 min_impurity_split (float, (default=1e7)) – Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
 class_weight (dict, list of dicts, “balanced” or None, default=None) –
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multioutput problems, a list of dicts can be provided in the same order as the columns of y.
Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for fourclass multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
For multioutput, the weights of each column of y will be multiplied.
Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
 presort (bool, optional (default=False)) – Whether to presort the data to speed up the finding of best splits in fitting. For the default settings of a decision tree on large datasets, setting this to true may slow down the training process. When using either a smaller dataset or a restricted depth, this may speed up the training.
 ccp_alphanonnegative (float, optional (default=0.0)) – Complexity parameter used for Minimal CostComplexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal CostComplexity Pruning for details.
Returns: ClassificationModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.DecisionTreeClassification() >>> model.DecisionTreeClassification(model_name='m1', min_impurity_split=0.0003) >>> model.DecisionTreeClassification(cv_type='kfold') >>> model.DecisionTreeClassification(gridsearch={'min_impurity_split':[0.01, 0.02]}, cv_type='stratkfold') >>> model.DecisionTreeClassification(run=False) # Add model to the queue

GaussianClassification
(cv_type=None, gridsearch=None, score='accuracy', model_name='gauss', run=True, verbose=1, **kwargs)¶ Trains a Gaussian Naive Bayes classification model.
For more Gaussian Naive Bayes info, you can view it here: https://scikitlearn.org/stable/modules/naive_bayes.html#gaussiannaivebayes
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type (bool, optional) – If True run crossvalidation on the model, by default False.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “gauss”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 priors (arraylike, shape (n_classes,)) – Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
 var_smoothing (float, optional (default=1e9)) – Portion of the largest variance of all features that is added to variances for calculation stability.
Returns: ClassificationModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.GaussianClassification() >>> model.GaussianClassification(model_name='m1', var_smooting=0.0003) >>> model.GaussianClassification(cv_type='kfold') >>> model.GaussianClassification(gridsearch={'var_smoothing':[0.01, 0.02]}, cv_type='stratkfold') >>> model.GaussianClassification(run=False) # Add model to the queue

GradientBoostingClassification
(cv_type=None, gridsearch=None, score='accuracy', model_name='grad_cls', run=True, verbose=1, **kwargs)¶ Trains a Gradient Boosting classification model.
GB builds an additive model in a forward stagewise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function. Binary classification is a special case where only a single regression tree is induced.
For more Gradient Boosting Classifier info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “grad_cls”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 loss ({‘deviance’, ‘exponential’}, optional (default=’deviance’)) – loss function to be optimized. ‘deviance’ refers to deviance (= logistic regression) for classification with probabilistic outputs. For loss ‘exponential’ gradient boosting recovers the AdaBoost algorithm.
 learning_rate (float, optional (default=0.1)) – learning rate shrinks the contribution of each tree by learning_rate. There is a tradeoff between learning_rate and n_estimators.
 n_estimators (int (default=100)) – The number of boosting stages to perform. Gradient boosting is fairly robust to overfitting so a large number usually results in better performance.
 subsample (float, optional (default=1.0)) – The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. Subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.
 criterion (string, optional (default=”friedman_mse”)) – The function to measure the quality of a split. Supported criteria are “friedman_mse” for the mean squared error with improvement score by Friedman, “mse” for mean squared error, and “mae” for the mean absolute error. The default value of “friedman_mse” is generally the best as it can provide a better approximation in some cases.
 min_samples_split (int, float, optional (default=2)) –
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.  min_samples_leaf (int, float, optional (default=1)) –
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.  max_depth (integer, optional (default=3)) – maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.
 max_features (int, float, string or None, optional (default=None)) –
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split. If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.Choosing max_features < n_features leads to a reduction of variance and an increase in bias.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
 max_leaf_nodes (int or None, optional (default=None)) – Grow trees with max_leaf_nodes in bestfirst fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
 presort (bool or ‘auto’, optional (default=’auto’)) – Whether to presort the data to speed up the finding of best splits in fitting. Auto mode by default will use presorting on dense data and default to normal sorting on sparse data. Setting presort to true on sparse data will raise an error.
 validation_fraction (float, optional, default 0.1) – The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if n_iter_no_change is set to an integer.
 tol (float, optional, default 1e4) – Tolerance for the early stopping. When the loss is not improving by at least tol for n_iter_no_change iterations (if set to a number), the training stops.
Returns: ClassificationModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.GradientBoostingClassification() >>> model.GradientBoostingClassification(model_name='m1', n_estimators=100) >>> model.GradientBoostingClassification(cv_type='kfold') >>> model.GradientBoostingClassification(gridsearch={'n_estimators':[100, 200]}, cv_type='stratkfold') >>> model.GradientBoostingClassification(run=False) # Add model to the queue

LightGBMClassification
(cv_type=None, gridsearch=None, score='accuracy', model_name='lgbm_cls', run=True, verbose=1, **kwargs)¶ Trains an LightGBM Classification Model.
LightGBM is a gradient boosting framework that uses a tree based learning algorithm.
Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leafwise while other algorithm grows levelwise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leafwise algorithm can reduce more loss than a levelwise algorithm.
For more LightGBM info, you can view it here: https://github.com/microsoft/LightGBM and https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “lgbm_cls”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 (string, optional (default='gbdt')) (boosting_type) – ‘gbdt’, traditional Gradient Boosting Decision Tree. ‘dart’, Dropouts meet Multiple Additive Regression Trees. ‘goss’, Gradientbased OneSide Sampling. ‘rf’, Random Forest.
 (int, optional (default=31)) (num_leaves) – Maximum tree leaves for base learners.
 (int, optional (default=1)) (n_jobs) – Maximum tree depth for base learners, <=0 means no limit.
 (float, optional (default=0.1)) (learning_rate) – Boosting learning rate. You can use callbacks parameter of fit method to shrink/adapt learning rate in training using reset_parameter callback. Note, that this will ignore the learning_rate argument in training.
 (int, optional (default=100)) (n_estimators) – Number of boosted trees to fit.
 (int, optional (default=200000)) (subsample_for_bin) – Number of samples for constructing bins.
 (string, callable or None, optional (default=None)) (objective) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below). Default: ‘regression’ for LGBMRegressor, ‘binary’ or ‘multiclass’ for LGBMClassifier, ‘lambdarank’ for LGBMRanker.
 (dict, 'balanced' or None, optional (default=None)) (class_weight) – Weights associated with classes in the form {class_label: weight}. Use this parameter only for multiclass classification task; for binary classification task you may use is_unbalance or scale_pos_weight parameters. Note, that the usage of all these parameters will result in poor estimates of the individual class probabilities. You may want to consider performing probability calibration (https://scikitlearn.org/stable/modules/calibration.html) of your model. The ‘balanced’ mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). If None, all classes are supposed to have weight one. Note, that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
 (float, optional (default=0.)) (reg_lambda) – Minimum loss reduction required to make a further partition on a leaf node of the tree.
 (float, optional (default=1e3)) (min_child_weight) – Minimum sum of instance weight (hessian) needed in a child (leaf).
 (int, optional (default=20)) (min_child_samples) – Minimum number of data needed in a child (leaf).
 (float, optional (default=1.)) (colsample_bytree) – Subsample ratio of the training instance.
 (int, optional (default=0)) (subsample_freq) – Frequence of subsample, <=0 means no enable.
 (float, optional (default=1.)) – Subsample ratio of columns when constructing each tree.
 (float, optional (default=0.)) – L1 regularization term on weights.
 (float, optional (default=0.)) – L2 regularization term on weights.
 (int or None, optional (default=None)) (random_state) – Random number seed. If None, default seeds in C++ code will be used.
 (int, optional (default=1)) – Number of parallel threads.
 (bool, optional (default=True)) (silent) – Whether to print messages while running boosting.
 (string, optional (default='split')) (importance_type) – The type of feature importance to be filled into feature_importances_. If ‘split’, result contains numbers of times the feature is used in a model. If ‘gain’, result contains total gains of splits which use the feature.
Returns: ClassificationModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.LightGBMClassification() >>> model.LightGBMClassification(model_name='m1', reg_alpha=0.0003) >>> model.LightGBMClassification(cv_type='kfold') >>> model.LightGBMClassification(gridsearch={'reg_alpha':[0.01, 0.02]}, cv_type='stratkfold') >>> model.LightGBMClassification(run=False) # Add model to the queue

LinearSVC
(cv_type=None, gridsearch=None, score='accuracy', model_name='linsvc', run=True, verbose=1, **kwargs)¶ Trains a Linear Support Vector classification model.
Supports multi classification.
Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples. This class supports both dense and sparse input and the multiclass support is handled according to a onevstherest scheme.
For more Support Vector info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “linsvc”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 penalty (string, ‘l1’ or ‘l2’ (default=’l2’)) – Specifies the norm used in the penalization. The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to coef_ vectors that are sparse.
 loss (string, ‘hinge’ or ‘squared_hinge’ (default=’squared_hinge’)) – Specifies the loss function. ‘hinge’ is the standard SVM loss (used e.g. by the SVC class) while ‘squared_hinge’ is the square of the hinge loss.
 dual (bool, (default=True)) – Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features.
 tol (float, optional (default=1e4)) – Tolerance for stopping criteria.
 C (float, optional (default=1.0)) – Penalty parameter C of the error term.
 multi_class (string, ‘ovr’ or ‘crammer_singer’ (default=’ovr’)) – Determines the multiclass strategy if y contains more than two classes. “ovr” trains n_classes onevsrest classifiers, while “crammer_singer” optimizes a joint objective over all classes. While crammer_singer is interesting from a theoretical perspective as it is consistent, it is seldom used in practice as it rarely leads to better accuracy and is more expensive to compute. If “crammer_singer” is chosen, the options loss, penalty and dual will be ignored.
 fit_intercept (boolean, optional (default=True)) – Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be already centered).
 intercept_scaling (float, optional (default=1)) – When self.fit_intercept is True, instance vector x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equals to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.
 class_weight ({dict, ‘balanced’}, optional) – Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
 max_iter (int, (default=1000)) – The maximum number of iterations to be run.
Returns: ClassificationModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.LinearSVC() >>> model.LinearSVC(model_name='m1', C=0.0003) >>> model.LinearSVC(cv_type='kfold') >>> model.LinearSVC(gridsearch={'C':[0.01, 0.02]}, cv_type='stratkfold') >>> model.LinearSVC(run=False) # Add model to the queue

LogisticRegression
(cv_type=None, gridsearch=None, score='accuracy', model_name='log_reg', run=True, verbose=1, **kwargs)¶ Trains a logistic regression model.
For more Logistic Regression info, you can view them here: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
 If running grid search, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type ({kfold, stratkfold}, Crossvalidation Generator, optional) – Cross validation method, by default None
 gridsearch (dict, optional) – Parameters to gridsearch, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “log_reg”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 penalty (str, ‘l1’, ‘l2’, ‘elasticnet’ or ‘none’, optional (default=’l2’)) – Used to specify the norm used in the penalization. The ‘newtoncg’, ‘sag’ and ‘lbfgs’ solvers support only l2 penalties. ‘elasticnet’ is only supported by the ‘saga’ solver. If ‘none’ (not supported by the liblinear solver), no regularization is applied.
 tol (float, optional (default=1e4)) – Tolerance for stopping criteria.
 C (float, optional (default=1.0)) – Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.
 class_weight (dict or ‘balanced’, optional (default=None)) – Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
Returns: ClassificationModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.LogisticRegression() >>> model.LogisticRegression(model_name='lg_1', C=0.001) >>> model.LogisticRegression(cv_type='kfold') >>> model.LogisticRegression(gridsearch={'C':[0.01, 0.02]}, cv_type='stratkfold') >>> model.LogisticRegression(run=False) # Add model to the queue

MultinomialClassification
(cv_type=None, gridsearch=None, score='accuracy', model_name='multi', run=True, verbose=1, **kwargs)¶ Trains a Multinomial Naive Bayes classification model.
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tfidf may also work.
For more Multinomial Naive Bayes info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB and https://scikitlearn.org/stable/modules/naive_bayes.html#multinomialnaivebayes
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type (bool, optional) – If True run crossvalidation on the model, by default False.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “multi”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 alpha (float, optional (default=1.0)) – Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).
 fit_prior (boolean, optional (default=True)) – Whether to learn class prior probabilities or not. If false, a uniform prior will be used.
 class_prior (arraylike, size (n_classes,), optional (default=None)) – Prior probabilities of the classes. If specified the priors are not adjusted according to the data.
Returns: ClassificationModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.MultinomialClassification() >>> model.MultinomialClassification(model_name='m1', alpha=0.0003) >>> model.MultinomialClassification(cv_type='kfold') >>> model.MultinomialClassification(gridsearch={'alpha':[0.01, 0.02]}, cv_type='stratkfold') >>> model.MultinomialClassification(run=False) # Add model to the queue

RandomForestClassification
(cv_type=None, gridsearch=None, score='accuracy', model_name='rf_cls', run=True, verbose=1, **kwargs)¶ Trains a Random Forest classification model.
A random forest is a meta estimator that fits a number of decision tree classifiers on various subsamples of the dataset and uses averaging to improve the predictive accuracy and control overfitting. The subsample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
For more Random Forest info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “rf_cls”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 n_estimators (integer, optional (default=10)) – The number of trees in the forest.
 criterion (string, optional (default=”gini”)) –
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.
Note: this parameter is treespecific.
 max_depth (integer or None, optional (default=None)) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
 min_samples_split (int, float, optional (default=2)) –
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.  min_samples_leaf (int, float, optional (default=1)) –
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.  max_features (int, float, string or None, optional (default=”auto”)) –
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split. If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features) (same as “auto”). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
 max_leaf_nodes (int or None, optional (default=None)) – Grow trees with max_leaf_nodes in bestfirst fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
 min_impurity_decrease (float, optional (default=0.)) –
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
 N_t / N * (impurity  N_t_R / N_t * right_impurity
 N_t_L / N_t * left_impurity)
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.
 bootstrap (boolean, optional (default=True)) – Whether bootstrap samples are used when building trees. If False, the whole datset is used to build each tree.
 oob_score (bool (default=False)) – Whether to use outofbag samples to estimate the generalization accuracy.
 class_weight (dict, list of dicts, “balanced”, “balanced_subsample” or None, optional (default=None)) –
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multioutput problems, a list of dicts can be provided in the same order as the columns of y. Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for fourclass multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}]. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)) The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown. For multioutput, the weights of each column of y will be multiplied.
Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
 ccp_alphanonnegative (float, optional (default=0.0)) – Complexity parameter used for Minimal CostComplexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal CostComplexity Pruning for details.
Returns: ClassificationModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.RandomForestClassification() >>> model.RandomForestClassification(model_name='m1', n_estimators=100) >>> model.RandomForestClassification(cv_type='kfold') >>> model.RandomForestClassification(gridsearch={'n_estimators':[100, 200]}, cv_type='stratkfold') >>> model.RandomForestClassification(run=False) # Add model to the queue

RidgeClassification
(cv_type=None, gridsearch=None, score='accuracy', model_name='ridge_cls', run=True, verbose=1, **kwargs)¶ Trains a Ridge Classification model.
For more Ridge Regression parameters, you can view them here: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type ({kfold, stratkfold}, Crossvalidation Generator, optional) – Cross validation method, by default None
 gridsearch (dict, optional) – Parameters to gridsearch, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “ridge_cls”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 alpha (float) – Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to C^1 in other linear models such as LogisticRegression or LinearSVC.
 fit_intercept (boolean) – Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (e.g. data is expected to be already centered).
 normalize (boolean, optional, default False) – This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2norm.
 tol (float, optional (default=1e4)) – Tolerance for stopping criteria.
 class_weight (dict or ‘balanced’, optional (default=None)) – Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
Returns: ClassificationModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.RidgeClassification() >>> model.RidgeClassification(model_name='rc_1, tol=0.001) >>> model.RidgeClassification(cv_type='kfold') >>> model.RidgeClassification(gridsearch={'alpha':[0.01, 0.02]}, cv_type='stratkfold') >>> model.RidgeClassification(run=False) # Add model to the queue

SGDClassification
(cv_type=None, gridsearch=None, score='accuracy', model_name='sgd_cls', run=True, verbose=1, **kwargs)¶ Trains a Linear classifier (SVM, logistic regression, a.o.) with SGD training.
For more info please view it here: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type ({kfold, stratkfold}, Crossvalidation Generator, optional) – Cross validation method, by default None
 gridsearch (dict, optional) – Parameters to gridsearch, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “sgd_cls”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 loss (str, default: ‘hinge’) –
The loss function to be used. Defaults to ‘hinge’, which gives a linear SVM. The possible options are ‘hinge’, ‘log’, ‘modified_huber’, ‘squared_hinge’, ‘perceptron’, or a regression loss: ‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’.
The ‘log’ loss gives logistic regression, a probabilistic classifier. ‘modified_huber’ is another smooth loss that brings tolerance to outliers as well as probability estimates. ‘squared_hinge’ is like hinge but is quadratically penalized. ‘perceptron’ is the linear loss used by the perceptron algorithm. The other losses are designed for regression but can be useful in classification as well; see SGDRegressor for a description.
 penalty (str, ‘none’, ‘l2’, ‘l1’, or ‘elasticnet’) – The penalty (aka regularization term) to be used. Defaults to ‘l2’ which is the standard regularizer for linear SVM models. ‘l1’ and ‘elasticnet’ might bring sparsity to the model (feature selection) not achievable with ‘l2’.
 alpha (float) – Constant that multiplies the regularization term. Defaults to 0.0001 Also used to compute learning_rate when set to ‘optimal’.
 l1_ratio (float) – The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Defaults to 0.15.
 fit_intercept (bool) – Whether the intercept should be estimated or not. If False, the data is assumed to be already centered. Defaults to True.
 max_iter (int, optional (default=1000)) – The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit.
 tol (float or None, optional (default=1e3)) – The stopping criterion. If it is not None, the iterations will stop when (loss > best_loss  tol) for n_iter_no_change consecutive epochs.
 shuffle (bool, optional) – Whether or not the training data should be shuffled after each epoch. Defaults to True.
 epsilon (float) – Epsilon in the epsiloninsensitive loss functions; only if loss is ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’. For ‘huber’, determines the threshold at which it becomes less important to get the prediction exactly right. For epsiloninsensitive, any differences between the current prediction and the correct label are ignored if they are less than this threshold.
 learning_rate (string, optional) –
The learning rate schedule:
‘constant’:
eta = eta0‘optimal’: [default]
eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed by Leon Bottou.‘invscaling’:
eta = eta0 / pow(t, power_t)‘adaptive’:
eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.  eta0 (double) – The initial learning rate for the ‘constant’, ‘invscaling’ or ‘adaptive’ schedules. The default value is 0.0 as eta0 is not used by the default schedule ‘optimal’.
 power_t (double) – The exponent for inverse scaling learning rate [default 0.5].
 early_stopping (bool, default=False) – Whether to use early stopping to terminate training when validation score is not improving. If set to True, it will automatically set aside a stratified fraction of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs.
 validation_fraction (float, default=0.1) – The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True.
 n_iter_no_change (int, default=5) – Number of iterations with no improvement to wait before early stopping.
 class_weight (dict, {class_label: weight} or “balanced” or None, optional) –
Preset for the class_weight fit parameter.
Weights associated with classes. If not given, all classes are supposed to have weight one.
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
 average (bool or int, optional) – When set to True, computes the averaged SGD weights and stores the result in the coef_ attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.
Returns: ClassificationModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.SGDClassification() >>> model.SGDClassification(model_name='rc_1, tol=0.001) >>> model.SGDClassification(cv_type='kfold') >>> model.SGDClassification(gridsearch={'alpha':[0.01, 0.02]}, cv_type='stratkfold') >>> model.SGDClassification(run=False) # Add model to the queue

SVC
(cv_type=None, gridsearch=None, score='accuracy', model_name='svc_cls', run=True, verbose=1, **kwargs)¶ Trains a CSupport Vector classification model.
Supports multi classification.
The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using model.linearsvc or model.sgd_classification instead
The multiclass support is handled according to a onevsone scheme.
For more Support Vector info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “linsvc_cls”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 C (float, optional (default=1.0)) – Penalty parameter C of the error term.
 kernel (string, optional (default=’rbf’)) – Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).
 degree (int, optional (default=3)) – Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
 gamma (float, optional (default=’auto’)) – Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Current default is ‘auto’ which uses 1 / n_features, if gamma=’scale’ is passed then it uses 1 / (n_features * X.var()) as value of gamma.
 coef0 (float, optional (default=0.0)) – Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.
 shrinking (boolean, optional (default=True)) – Whether to use the shrinking heuristic.
 probability (boolean, optional (default=False)) – Whether to enable probability estimates. This must be enabled prior to calling fit, and will slow down that method.
 tol (float, optional (default=1e3)) – Tolerance for stopping criterion.
 cache_size (float, optional) – Specify the size of the kernel cache (in MB).
 class_weight ({dict, ‘balanced’}, optional) – Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
 max_iter (int, optional (default=1)) – Hard limit on iterations within solver, or 1 for no limit.
 decision_function_shape (‘ovo’, ‘ovr’, default=’ovr’) – Whether to return a onevsrest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original onevsone (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes  1) / 2). However, onevsone (‘ovo’) is always used as multiclass strategy.
Returns: ClassificationModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.SVC() >>> model.SVC(model_name='m1', C=0.0003) >>> model.SVC(cv_type='kfold') >>> model.SVC(gridsearch={'C':[0.01, 0.02]}, cv_type='stratkfold') >>> model.SVC(run=False) # Add model to the queue

XGBoostClassification
(cv_type=None, gridsearch=None, score='accuracy', model_name='xgb_cls', run=True, verbose=1, **kwargs)¶ Trains an XGBoost Classification Model.
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.
For more XGBoost info, you can view it here: https://xgboost.readthedocs.io/en/latest/ and https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst.
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’
Parameters:  cv_type (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘accuracy’
 model_name (str, optional) – Name for this model, by default “xgb_cls”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 max_depth (int) – Maximum tree depth for base learners. By default 3
 learning_rate (float) – Boosting learning rate (xgb’s “eta”). By default 0.1
 n_estimators (int) – Number of trees to fit. By default 100.
 objective (string or callable) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below). By default binary:logistic for binary classification or multi:softprob for multiclass classification
 booster (string) – Specify which booster to use: gbtree, gblinear or dart. By default ‘gbtree’
 tree_method (string) – Specify which tree method to use If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from parameters document. By default ‘auto’
 gamma (float) – Minimum loss reduction required to make a further partition on a leaf node of the tree. By default 0
 subsample (float) – Subsample ratio of the training instance. By default 1
 reg_alpha (float (xgb's alpha)) – L1 regularization term on weights. By default 0
 reg_lambda (float (xgb's lambda)) – L2 regularization term on weights. By default 1
 scale_pos_weight (float) – Balancing of positive and negative weights. By default 1
 base_score – The initial prediction score of all instances, global bias. By default 0
 missing (float, optional) – Value in the data which needs to be present as a missing value. If None, defaults to np.nan. By default, None
 num_parallel_tree (int) – Used for boosting random forest. By default 1
 importance_type (string, default "gain") – The feature importance type for the feature_importances_ property: either “gain”, “weight”, “cover”, “total_gain” or “total_cover”. By default ‘gain’.
Note
A custom objective function can be provided for the
objective
parameter. In this case, it should have the signatureobjective(y_true, y_pred) > grad, hess
: y_true: array_like of shape [n_samples]
 The target values
 y_pred: array_like of shape [n_samples]
 The predicted values
 grad: array_like of shape [n_samples]
 The value of the gradient for each sample point.
 hess: array_like of shape [n_samples]
 The value of the second derivative for each sample point
Returns: ClassificationModelAnalysis object to view results and analyze results Return type: ClassificationModelAnalysis Examples
>>> model.XGBoostClassification() >>> model.XGBoostClassification(model_name='m1', reg_alpha=0.0003) >>> model.XGBoostClassification(cv_type='kfold') >>> model.XGBoostClassification(gridsearch={'reg_alpha':[0.01, 0.02]}, cv_type='stratkfold') >>> model.XGBoostClassification(run=False) # Add model to the queue


class
aethos.modelling.regression_models.
Regression
(x_train, target, x_test=None, test_split_percentage=0.2, exp_name='myexperiment')¶ Bases:
aethos.modelling.model.ModelBase
,aethos.analysis.Analysis
,aethos.cleaning.clean.Clean
,aethos.preprocessing.preprocess.Preprocess
,aethos.feature_engineering.feature.Feature
,aethos.visualizations.visualizations.Visualizations
,aethos.stats.stats.Stats

ADABoostRegression
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='ada_reg', run=True, verbose=1, **kwargs)¶ Trains an AdaBoost Regression model.
An AdaBoost classifier is a metaestimator that begins by fitting a regressor on the original dataset and then fits additional copies of the regressor on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent regressors focus more on difficult cases.
For more AdaBoost info, you can view it here:https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html#sklearn.ensemble.AdaBoostRegressor
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 gridsearch (dict, optional) – Parameters to gridsearch, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name (str, optional) – Name for this model, by default “ada_reg”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 base_estimator (object, optional (default=None)) – The base estimator from which the boosted ensemble is built. Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes. If None, then the base estimator is DecisionTreeRegressor(max_depth=3)
 n_estimators (integer, optional (default=50)) – The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.
 learning_rate (float, optional (default=1.)) – Learning rate shrinks the contribution of each classifier by learning_rate. There is a tradeoff between learning_rate and n_estimators.
 loss ({‘linear’, ‘square’, ‘exponential’}, optional (default=’linear’)) – The loss function to use when updating the weights after each boosting iteration.
Returns: RegressionModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.AdaBoostRegression() >>> model.AdaBoostRegression(model_name='m1', learning_rate=0.0003) >>> model.AdaBoostRegression(cv=10) >>> model.AdaBoostRegression(gridsearch={'learning_rate':[0.01, 0.02]}, cv='stratkfold') >>> model.AdaBoostRegression(run=False) # Add model to the queue

BaggingRegression
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='bag_reg', run=True, verbose=1, **kwargs)¶ Trains a Bagging Regressor model.
A Bagging classifier is an ensemble metaestimator that fits base regressors each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a metaestimator can typically be used as a way to reduce the variance of a blackbox estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.
For more Bagging Classifier info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressor
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 gridsearch (dict, optional) – Parameters to gridsearch, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name (str, optional) – Name for this model, by default “bag_reg”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 base_estimator (object or None, optional (default=None)) – The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.
 n_estimators (int, optional (default=10)) – The number of base estimators in the ensemble.
 max_samples (int or float, optional (default=1.0)) –
The number of samples to draw from X to train each base estimator.
If int, then draw max_samples samples. If float, then draw max_samples * X.shape[0] samples.  max_features (int or float, optional (default=1.0)) –
The number of features to draw from X to train each base estimator.
If int, then draw max_features features. If float, then draw max_features * X.shape[1] features.  bootstrap (boolean, optional (default=True)) – Whether samples are drawn with replacement. If False, sampling without replacement is performed.
 bootstrap_features (boolean, optional (default=False)) – Whether features are drawn with replacement.
 oob_score (bool, optional (default=False)) – Whether to use outofbag samples to estimate the generalization error.
Returns: RegressionModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.BaggingRegression() >>> model.BaggingRegression(model_name='m1', n_estimators=100) >>> model.BaggingRegression(cv=10) >>> model.BaggingRegression(gridsearch={'n_estimators':[100, 200]}, cv='stratkfold') >>> model.BaggingRegression(run=False) # Add model to the queue

BayesianRidgeRegression
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='bayridge_reg', run=True, verbose=1, **kwargs)¶ Trains a Bayesian Ridge Regression model.
For more Linear Regression info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html#sklearn.linear_model.BayesianRidge and https://scikitlearn.org/stable/modules/linear_model.html#bayesianregression
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name – Name for this model, by default “bayridge_reg”
 run : bool, optional
 Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose : int, optional
 Verbosity level of model output, the higher the number  the more verbose. By default, 1
 n_iter : int, optional
 Maximum number of iterations. Default is 300. Should be greater than or equal to 1.
 tol : float, optional
 Stop the algorithm if w has converged. Default is 1.e3.
 alpha_1 : float, optional
 Hyperparameter : shape parameter for the Gamma distribution prior over the alpha parameter. Default is 1.e6
 alpha_2 : float, optional
 Hyperparameter : inverse scale parameter (rate parameter) for the Gamma distribution prior over the alpha parameter. Default is 1.e6.
 lambda_1 : float, optional
 Hyperparameter : shape parameter for the Gamma distribution prior over the lambda parameter. Default is 1.e6.
 lambda_2 : float, optional
 Hyperparameter : inverse scale parameter (rate parameter) for the Gamma distribution prior over the lambda parameter. Default is 1.e6
 fit_intercept : boolean, optional, default True
 Whether to calculate the intercept for this model. The intercept is not treated as a probabilistic parameter and thus has no associated variance. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).
 normalize : boolean, optional, default False
 This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2norm.
Returns: RegressionModelAnalysis object to view results and analyze results Return type: RegressionModelAnalysis Examples
>>> model.BayesianRidgeRegression() >>> model.BayesianRidgeRegression(model_name='alpha_1', C=0.0003) >>> model.BayesianRidgeRegression(cv=10) >>> model.BayesianRidgeRegression(gridsearch={'alpha_2':[0.01, 0.02]}, cv='stratkfold') >>> model.BayesianRidgeRegression(run=False) # Add model to the queue

DecisionTreeRegression
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='dt_reg', run=True, verbose=1, **kwargs)¶ Trains a Decision Tree Regression model.
For more Decision Tree info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name (str, optional) – Name for this model, by default “dt_reg”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 criterion (string, optional (default=”mse”)) –
The function to measure the quality of a split.
 Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node,
 “friedman_mse”, which uses mean squared error with Friedman’s improvement score for potential splits, and “mae” for the mean absolute error, which minimizes the L1 loss using the median of each terminal node.
 splitter (string, optional (default=”best”)) – The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.
 max_depth (int or None, optional (default=None)) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
 min_samples_split (int, float, optional (default=2)) –
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.  min_samples_leaf (int, float, optional (default=1)) –
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.  max_features (int, float, string or None, optional (default=None)) –
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split. If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
 max_leaf_nodes (int or None, optional (default=None)) – Grow a tree with max_leaf_nodes in bestfirst fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
 min_impurity_decrease (float, optional (default=0.)) –
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
 N_t / N * (impurity  N_t_R / N_t * right_impurity
 N_t_L / N_t * left_impurity)
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.
 min_impurity_split (float, (default=1e7)) – Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
 presort (bool, optional (default=False)) – Whether to presort the data to speed up the finding of best splits in fitting. For the default settings of a decision tree on large datasets, setting this to true may slow down the training process. When using either a smaller dataset or a restricted depth, this may speed up the training.
 ccp_alphanonnegative (float, optional (default=0.0)) – Complexity parameter used for Minimal CostComplexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal CostComplexity Pruning for details.
Returns: RegressionModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.DecisionTreeRegression() >>> model.DecisionTreeRegression(model_name='m1', min_impurity_split=0.0003) >>> model.DecisionTreeRegression(cv=10) >>> model.DecisionTreeRegression(gridsearch={'min_impurity_split':[0.01, 0.02]}, cv='stratkfold') >>> model.DecisionTreeRegression(run=False) # Add model to the queue

ElasticnetRegression
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='elastic', run=True, verbose=1, **kwargs)¶ Elastic Net regression with combined L1 and L2 priors as regularizer.
For more Linear Regression info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name (str, optional) – Name for this model, by default “elastic”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 alpha (float, optional) – Constant that multiplies the penalty terms.
Defaults to 1.0. See the notes for the exact mathematical meaning of this parameter.
alpha = 0
is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.  l1_ratio (float) – The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
 fit_intercept (bool) – Whether the intercept should be estimated or not. If False, the data is assumed to be already centered.
 normalize (boolean, optional, default False) – This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2norm. If you wish to standardize, please use sklearn.preprocessing.
 precompute (True  False  arraylike) – Whether to use a precomputed Gram matrix to speed up calculations. The Gram matrix can also be passed as argument. For sparse input this option is always True to preserve sparsity.
 max_iter (int, optional) – The maximum number of iterations
 tol (float, optional) – The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
 positive (bool, optional) – When set to True, forces the coefficients to be positive.
 selection (str, default ‘cyclic’) – If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e4.
Returns: RegressionModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.ElasticNetRegression() >>> model.ElasticNetRegression(model_name='m1', alpha=0.0003) >>> model.ElasticNetRegression(cv=10) >>> model.ElasticNetRegression(gridsearch={'alpha':[0.01, 0.02]}, cv='stratkfold') >>> model.ElasticNetRegression(run=False) # Add model to the queue

GradientBoostingRegression
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='grad_reg', run=True, verbose=1, **kwargs)¶ Trains a Gradient Boosting regression model.
GB builds an additive model in a forward stagewise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the binomial or multinomial deviance loss function.
For more Gradient Boosting Classifier info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name (str, optional) – Name for this model, by default “grad_reg”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 loss ({‘ls’, ‘lad’, ‘huber’, ‘quantile’}, optional (default=’ls’)) –
loss function to be optimized.
‘ls’ refers to least squares regression. ‘lad’ (least absolute deviation) is a highly robust loss function solely based on order information of the input variables. ‘huber’ is a combination of the two. ‘quantile’ allows quantile regression (use alpha to specify the quantile).
 learning_rate (float, optional (default=0.1)) – learning rate shrinks the contribution of each tree by learning_rate. There is a tradeoff between learning_rate and n_estimators.
 n_estimators (int (default=100)) – The number of boosting stages to perform. Gradient boosting is fairly robust to overfitting so a large number usually results in better performance.
 subsample (float, optional (default=1.0)) – The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. Subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias.
 criterion (string, optional (default=”friedman_mse”)) – The function to measure the quality of a split. Supported criteria are “friedman_mse” for the mean squared error with improvement score by Friedman, “mse” for mean squared error, and “mae” for the mean absolute error. The default value of “friedman_mse” is generally the best as it can provide a better approximation in some cases.
 min_samples_split (int, float, optional (default=2)) –
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.  min_samples_leaf (int, float, optional (default=1)) –
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.  max_depth (integer, optional (default=3)) – maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables.
 max_features (int, float, string or None, optional (default=None)) –
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split. If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.Choosing max_features < n_features leads to a reduction of variance and an increase in bias.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
 alpha (float (default=0.9)) – The alphaquantile of the huber loss function and the quantile loss function. Only if loss=’huber’ or loss=’quantile’.
 max_leaf_nodes (int or None, optional (default=None)) – Grow trees with max_leaf_nodes in bestfirst fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
 presort (bool or ‘auto’, optional (default=’auto’)) – Whether to presort the data to speed up the finding of best splits in fitting. Auto mode by default will use presorting on dense data and default to normal sorting on sparse data. Setting presort to true on sparse data will raise an error.
 validation_fraction (float, optional, default 0.1) – The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if n_iter_no_change is set to an integer.
 tol (float, optional, default 1e4) – Tolerance for the early stopping. When the loss is not improving by at least tol for n_iter_no_change iterations (if set to a number), the training stops.
Returns: RegressionModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.GradientBoostingRegression() >>> model.GradientBoostingRegression(model_name='m1', alpha=0.0003) >>> model.GradientBoostingRegression(cv=10) >>> model.GradientBoostingRegression(gridsearch={'alpha':[0.01, 0.02]}, cv='stratkfold') >>> model.GradientBoostingRegression(run=False) # Add model to the queue

LassoRegression
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='lasso', run=True, verbose=1, **kwargs)¶ Lasso Regression Model trained with L1 prior as regularizer (aka the Lasso)
Technically the Lasso model is optimizing the same objective function as the Elastic Net with l1_ratio=1.0 (no L2 penalty).
For more Lasso Regression info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name (str, optional) – Name for this model, by default “lasso”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 alpha (float, optional) – Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is equivalent to an ordinary least square, solved by the LinearRegression object. For numerical reasons, using alpha = 0 with the Lasso object is not advised. Given this, you should use the LinearRegression object.
 fit_intercept (boolean, optional, default True) – Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).
 normalize (boolean, optional, default False) – This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2norm.
 precompute (True  False  arraylike, default=False) – Whether to use a precomputed Gram matrix to speed up calculations. If set to ‘auto’ let us decide. The Gram matrix can also be passed as argument. For sparse input this option is always True to preserve sparsity.
 max_iter (int, optional) – The maximum number of iterations
 tol (float, optional) –
 The tolerance for the optimization:
 if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
 positive (bool, optional) – When set to True, forces the coefficients to be positive.
 selection (str, default ‘cyclic’) – If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e4.
Returns: RegressionModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.LassoRegression() >>> model.LassoRegression(model_name='m1', alpha=0.0003) >>> model.LassoRegression(cv=10) >>> model.LassoRegression(gridsearch={'alpha':[0.01, 0.02]}, cv='stratkfold') >>> model.LassoRegression(run=False) # Add model to the queue

LightGBMRegression
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='lgbm_reg', run=True, verbose=1, **kwargs)¶ Trains an LightGBM Regression Model.
LightGBM is a gradient boosting framework that uses a tree based learning algorithm.
Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leafwise while other algorithm grows levelwise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leafwise algorithm can reduce more loss than a levelwise algorithm.
For more LightGBM info, you can view it here: https://github.com/microsoft/LightGBM and https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html#lightgbm.LGBMRegressor
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name (str, optional) – Name for this model, by default “lgbm_reg”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 (string, optional (default='gbdt')) (boosting_type) – ‘gbdt’, traditional Gradient Boosting Decision Tree. ‘dart’, Dropouts meet Multiple Additive Regression Trees. ‘goss’, Gradientbased OneSide Sampling. ‘rf’, Random Forest.
 (int, optional (default=31)) (num_leaves) – Maximum tree leaves for base learners.
 (int, optional (default=1)) (n_jobs) – Maximum tree depth for base learners, <=0 means no limit.
 (float, optional (default=0.1)) (learning_rate) – Boosting learning rate. You can use callbacks parameter of fit method to shrink/adapt learning rate in training using reset_parameter callback. Note, that this will ignore the learning_rate argument in training.
 (int, optional (default=100)) (n_estimators) – Number of boosted trees to fit.
 (int, optional (default=200000)) (subsample_for_bin) – Number of samples for constructing bins.
 (string, callable or None, optional (default=None)) (objective) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below). Default: ‘regression’ for LGBMRegressor, ‘binary’ or ‘multiclass’ for LGBMClassifier, ‘lambdarank’ for LGBMRanker.
 (dict, 'balanced' or None, optional (default=None)) (class_weight) – Weights associated with classes in the form {class_label: weight}. Use this parameter only for multiclass classification task; for binary classification task you may use is_unbalance or scale_pos_weight parameters. Note, that the usage of all these parameters will result in poor estimates of the individual class probabilities. You may want to consider performing probability calibration (https://scikitlearn.org/stable/modules/calibration.html) of your model. The ‘balanced’ mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). If None, all classes are supposed to have weight one. Note, that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
 (float, optional (default=0.)) (reg_lambda) – Minimum loss reduction required to make a further partition on a leaf node of the tree.
 (float, optional (default=1e3)) (min_child_weight) – Minimum sum of instance weight (hessian) needed in a child (leaf).
 (int, optional (default=20)) (min_child_samples) – Minimum number of data needed in a child (leaf).
 (float, optional (default=1.)) (colsample_bytree) – Subsample ratio of the training instance.
 (int, optional (default=0)) (subsample_freq) – Frequence of subsample, <=0 means no enable.
 (float, optional (default=1.)) – Subsample ratio of columns when constructing each tree.
 (float, optional (default=0.)) – L1 regularization term on weights.
 (float, optional (default=0.)) – L2 regularization term on weights.
 (int or None, optional (default=None)) (random_state) – Random number seed. If None, default seeds in C++ code will be used.
 (int, optional (default=1)) – Number of parallel threads.
 (bool, optional (default=True)) (silent) – Whether to print messages while running boosting.
 (string, optional (default='split')) (importance_type) – The type of feature importance to be filled into feature_importances_. If ‘split’, result contains numbers of times the feature is used in a model. If ‘gain’, result contains total gains of splits which use the feature.
Returns: RegressionModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.LightGBMRegression() >>> model.LightGBMRegression(model_name='m1', reg_lambda=0.0003) >>> model.LightGBMRegression(cv=10) >>> model.LightGBMRegression(gridsearch={'reg_lambda':[0.01, 0.02]}, cv='stratkfold') >>> model.LightGBMRegression(run=False) # Add model to the queue

LinearRegression
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='lin_reg', run=True, verbose=1, **kwargs)¶ Trains a Linear Regression.
For more Linear Regression info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name (str, optional) – Name for this model, by default “lin_reg”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 fit_intercept (boolean, optional, default True) – whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).
 normalize (boolean, optional, default False) – This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2norm.
Returns: RegressionModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.LinearRegression() >>> model.LinearRegression(model_name='m1', normalize=True) >>> model.LinearRegression(cv=10) >>> model.LinearRegression(gridsearch={'normalize':[True, False]}, cv='stratkfold') >>> model.LinearRegression(run=False) # Add model to the queue

LinearSVR
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='linsvr', run=True, verbose=1, **kwargs)¶ Trains a Linear Support Vector Regression model.
Similar to SVR with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
For more Support Vector info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name (str, optional) – Name for this model, by default “linsvr_cls”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 epsilon (float, optional (default=0.0)) – Epsilon parameter in the epsiloninsensitive loss function. Note that the value of this parameter depends on the scale of the target variable y. If unsure, set epsilon=0.
 tol (float, optional (default=1e4)) – Tolerance for stopping criteria.
 C (float, optional (default=1.0)) – Penalty parameter C of the error term.
 loss (string, ‘hinge’ or ‘squared_hinge’ (default=’squared_hinge’)) – Specifies the loss function. ‘hinge’ is the standard SVM loss (used e.g. by the SVC class) while ‘squared_hinge’ is the square of the hinge loss.
 dual (bool, (default=True)) – Select the algorithm to either solve the dual or primal optimization problem. Prefer dual=False when n_samples > n_features.
 fit_intercept (boolean, optional (default=True)) – Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be already centered).
 intercept_scaling (float, optional (default=1)) – When self.fit_intercept is True, instance vector x becomes [x, self.intercept_scaling], i.e. a “synthetic” feature with constant value equals to intercept_scaling is appended to the instance vector. The intercept becomes intercept_scaling * synthetic feature weight Note! the synthetic feature weight is subject to l1/l2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) intercept_scaling has to be increased.
 max_iter (int, (default=1000)) – The maximum number of iterations to be run.
Returns: RegressionModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.LinearSVR() >>> model.LinearSVR(model_name='m1', C=0.0003) >>> model.LinearSVR(cv=10) >>> model.LinearSVR(gridsearch={'C':[0.01, 0.02]}, cv='stratkfold') >>> model.LinearSVR(run=False) # Add model to the queue

RandomForestRegression
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='rf_reg', run=True, verbose=1, **kwargs)¶ Trains a Random Forest Regression model.
A random forest is a meta estimator that fits a number of decision tree regressors on various subsamples of the dataset and uses averaging to improve the predictive accuracy and control overfitting. The subsample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
For more Random Forest info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name (str, optional) – Name for this model, by default “rf_reg”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 n_estimators (integer, optional (default=10)) – The number of trees in the forest.
 criterion (string, optional (default=”mse”)) – The function to measure the quality of a split. Supported criteria are “mse” for the mean squared error, which is equal to variance reduction as feature selection criterion, and “mae” for the mean absolute error.
 max_depth (integer or None, optional (default=None)) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
 min_samples_split (int, float, optional (default=2)) –
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.  min_samples_leaf (int, float, optional (default=1)) –
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
If int, then consider min_samples_leaf as the minimum number. If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.  max_features (int, float, string or None, optional (default=”auto”)) –
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split. If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split. If “auto”, then max_features=sqrt(n_features). If “sqrt”, then max_features=sqrt(n_features) (same as “auto”). If “log2”, then max_features=log2(n_features). If None, then max_features=n_features.Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
 max_leaf_nodes (int or None, optional (default=None)) – Grow trees with max_leaf_nodes in bestfirst fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
 min_impurity_decrease (float, optional (default=0.)) –
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
 N_t / N * (impurity  N_t_R / N_t * right_impurity
 N_t_L / N_t * left_impurity)
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.
 bootstrap (boolean, optional (default=True)) – Whether bootstrap samples are used when building trees. If False, the whole datset is used to build each tree.
 oob_score (bool (default=False)) – Whether to use outofbag samples to estimate the generalization accuracy.
 ccp_alphanonnegative (float, optional (default=0.0)) – Complexity parameter used for Minimal CostComplexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal CostComplexity Pruning for details.
Returns: RegressionModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.RandomForestRegression() >>> model.RandomForestRegression(model_name='m1', n_estimators=100) >>> model.RandomForestRegression(cv=10) >>> model.RandomForestRegression(gridsearch={'n_estimators':[100, 200]}, cv='stratkfold') >>> model.RandomForestRegression(run=False) # Add model to the queue

RidgeRegression
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='ridge_reg', run=True, verbose=1, **kwargs)¶ Trains a Ridge Regression model.
For more Ridge Regression info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name (str, optional) – Name for this model, by default “ridge”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 alpha ({float, arraylike}, shape (n_targets)) – Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to C^1 in other linear models such as LogisticRegression or LinearSVC. If an array is passed, penalties are assumed to be specific to the targets. Hence they must correspond in number.
 fit_intercept (boolean) – Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (e.g. data is expected to be already centered).
 normalize (boolean, optional, default False) – This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2norm.
 max_iter (int, optional) – Maximum number of iterations for conjugate gradient solver.
 tol (float) – Precision of the solution.
Returns: RegressionModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.RidgeRegression() >>> model.RidgeRegression(model_name='m1', alpha=0.0003) >>> model.RidgeRegression(cv=10) >>> model.RidgeRegression(gridsearch={'alpha':[0.01, 0.02]}, cv='stratkfold') >>> model.RidgeRegression(run=False) # Add model to the queue

SGDRegression
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='sgd_reg', run=True, verbose=1, **kwargs)¶ Trains a SGD Regression model.
Linear model fitted by minimizing a regularized empirical loss with SGD
SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate).
The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If the parameter update crosses the 0.0 value because of the regularizer, the update is truncated to 0.0 to allow for learning sparse models and achieve online feature selection.
For more SGD Regression info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name (str, optional) – Name for this model, by default “sgd_reg”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 loss (str, default: ‘squared_loss’) –
The loss function to be used.
The possible values are ‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’
The ‘squared_loss’ refers to the ordinary least squares fit. ‘huber’ modifies ‘squared_loss’ to focus less on getting outliers correct by switching from squared to linear loss past a distance of epsilon. ‘epsilon_insensitive’ ignores errors less than epsilon and is linear past that; this is the loss function used in SVR. ‘squared_epsilon_insensitive’ is the same but becomes squared loss past a tolerance of epsilon.
 penalty (str, ‘none’, ‘l2’, ‘l1’, or ‘elasticnet’) – The penalty (aka regularization term) to be used. Defaults to ‘l2’ which is the standard regularizer for linear SVM models. ‘l1’ and ‘elasticnet’ might bring sparsity to the model (feature selection) not achievable with ‘l2’.
 alpha (float) – Constant that multiplies the regularization term. Defaults to 0.0001 Also used to compute learning_rate when set to ‘optimal’.
 l1_ratio (float) – The Elastic Net mixing parameter, with 0 <= l1_ratio <= 1. l1_ratio=0 corresponds to L2 penalty, l1_ratio=1 to L1. Defaults to 0.15.
 fit_intercept (bool) – Whether the intercept should be estimated or not. If False, the data is assumed to be already centered. Defaults to True.
 max_iter (int, optional (default=1000)) – The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the fit method, and not the partial_fit.
 tol (float or None, optional (default=1e3)) – The stopping criterion. If it is not None, the iterations will stop when (loss > best_loss  tol) for n_iter_no_change consecutive epochs.
 shuffle (bool, optional) – Whether or not the training data should be shuffled after each epoch. Defaults to True.
 epsilon (float) –
Epsilon in the epsiloninsensitive loss functions; only if loss is ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’.
For ‘huber’, determines the threshold at which it becomes less important to get the prediction exactly right. For epsiloninsensitive, any differences between the current prediction and the correct label are ignored if they are less than this threshold.
 learning_rate (string, optional) –
The learning rate schedule:
 ‘constant’:
 eta = eta0
 ‘optimal’:
 eta = 1.0 / (alpha * (t + t0)) where t0 is chosen by a heuristic proposed by Leon Bottou.
 ‘invscaling’: [default]
 eta = eta0 / pow(t, power_t)
 ‘adaptive’:
 eta = eta0, as long as the training keeps decreasing. Each time n_iter_no_change consecutive epochs fail to decrease the training loss by tol or fail to increase validation score by tol if early_stopping is True, the current learning rate is divided by 5.
 eta0 (double) – The initial learning rate for the ‘constant’, ‘invscaling’ or ‘adaptive’ schedules. The default value is 0.01.
 power_t (double) – The exponent for inverse scaling learning rate [default 0.5].
 early_stopping (bool, default=False) – Whether to use early stopping to terminate training when validation score is not improving. If set to True, it will automatically set aside a fraction of training data as validation and terminate training when validation score is not improving by at least tol for n_iter_no_change consecutive epochs.
 validation_fraction (float, default=0.1) – The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True.
 n_iter_no_change (int, default=5) – Number of iterations with no improvement to wait before early stopping.
 average (bool or int, optional) – When set to True, computes the averaged SGD weights and stores the result in the coef_ attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.
Returns: RegressionModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.SGDRegression() >>> model.SGDRegression(model_name='m1', alpha=0.0003) >>> model.SGDRegression(cv=10) >>> model.SGDRegression(gridsearch={'alpha':[0.01, 0.02]}, cv='stratkfold') >>> model.SGDRegression(run=False) # Add model to the queue

SVR
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='svr_reg', run=True, verbose=1, **kwargs)¶ EpsilonSupport Vector Regression.
The free parameters in the model are C and epsilon.
The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using model.linearsvr or model.sgd_regression instead
For more Support Vector info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name (str, optional) – Name for this model, by default “linsvr”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 kernel (string, optional (default=’rbf’)) – Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).
 degree (int, optional (default=3)) – Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
 gamma (float, optional (default=’auto’)) – Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Current default is ‘auto’ which uses 1 / n_features, if gamma=’scale’ is passed then it uses 1 / (n_features * X.var()) as value of gamma.
 coef0 (float, optional (default=0.0)) – Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.
 tol (float, optional (default=1e3)) – Tolerance for stopping criterion.
 C (float, optional (default=1.0)) – Penalty parameter C of the error term.
 epsilon (float, optional (default=0.1)) – Epsilon in the epsilonSVR model. It specifies the epsilontube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value.
 shrinking (boolean, optional (default=True)) – Whether to use the shrinking heuristic.
 cache_size (float, optional) – Specify the size of the kernel cache (in MB).
 max_iter (int, optional (default=1)) – Hard limit on iterations within solver, or 1 for no limit.
Returns: RegressionModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.SVR() >>> model.SVR(model_name='m1', C=0.0003) >>> model.SVR(cv=10) >>> model.SVR(gridsearch={'C':[0.01, 0.02]}, cv='stratkfold') >>> model.SVR(run=False) # Add model to the queue

XGBoostRegression
(cv_type=None, gridsearch=None, score='neg_mean_squared_error', model_name='xgb_reg', run=True, verbose=1, **kwargs)¶ Trains an XGBoost Regression Model.
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.
For more XGBoost info, you can view it here: https://xgboost.readthedocs.io/en/latest/ and https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst.
 If running gridsearch, the implemented cross validators are:
 ‘kfold’ for KFold
 ‘stratkfold’ for StratifiedKfold
 Possible scoring metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
Parameters:  cv (bool, optional) – If True run crossvalidation on the model, by default None.
 gridsearch (int, Crossvalidation Generator, optional) – Cross validation method, by default None
 score (str, optional) – Scoring metric to evaluate models, by default ‘neg_mean_squared_error’
 model_name (str, optional) – Name for this model, by default “xgb_reg”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 max_depth (int) – Maximum tree depth for base learners. By default 3
 learning_rate (float) – Boosting learning rate (xgb’s “eta”). By default 0.1
 n_estimators (int) – Number of trees to fit. By default 100.
 objective (string or callable) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below). By default, reg:linear
 booster (string) – Specify which booster to use: gbtree, gblinear or dart. By default ‘gbtree’
 tree_method (string) – Specify which tree method to use If this parameter is set to default, XGBoost will choose the most conservative option available. It’s recommended to study this option from parameters document. By default ‘auto’
 gamma (float) – Minimum loss reduction required to make a further partition on a leaf node of the tree. By default 0
 subsample (float) – Subsample ratio of the training instance. By default 1
 reg_alpha (float (xgb's alpha)) – L1 regularization term on weights. By default 0
 reg_lambda (float (xgb's lambda)) – L2 regularization term on weights. By default 1
 scale_pos_weight (float) – Balancing of positive and negative weights. By default 1
 base_score – The initial prediction score of all instances, global bias. By default 0
 missing (float, optional) – Value in the data which needs to be present as a missing value. If None, defaults to np.nan. By default, None
 num_parallel_tree (int) – Used for boosting random forest. By default 1
 importance_type (string, default "gain") – The feature importance type for the feature_importances_ property: either “gain”, “weight”, “cover”, “total_gain” or “total_cover”. By default ‘gain’.
Note
A custom objective function can be provided for the
objective
parameter. In this case, it should have the signatureobjective(y_true, y_pred) > grad, hess
: y_true: array_like of shape [n_samples]
 The target values
 y_pred: array_like of shape [n_samples]
 The predicted values
 grad: array_like of shape [n_samples]
 The value of the gradient for each sample point.
 hess: array_like of shape [n_samples]
 The value of the second derivative for each sample point
Returns: RegressionModelAnalysis object to view results and analyze results Return type: RegressionModelAnalysis Examples
>>> model.XGBoostRegression() >>> model.XGBoostRegression(model_name='m1', reg_alpha=0.0003) >>> model.XGBoostRegression(cv=10) >>> model.XGBoostRegression(gridsearch={'reg_alpha':[0.01, 0.02]}, cv='stratkfold') >>> model.XGBoostRegression(run=False) # Add model to the queue


class
aethos.modelling.unsupervised_models.
Unsupervised
(x_train, exp_name='myexperiment')¶ Bases:
aethos.modelling.model.ModelBase
,aethos.analysis.Analysis
,aethos.cleaning.clean.Clean
,aethos.preprocessing.preprocess.Preprocess
,aethos.feature_engineering.feature.Feature
,aethos.visualizations.visualizations.Visualizations
,aethos.stats.stats.Stats

AgglomerativeClustering
(model_name='agglom', run=True, **kwargs)¶ Trains a Agglomerative Clustering Model
Each data point as a single cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all data points
Hierarchical clustering does not require us to specify the number of clusters and we can even select which number of clusters looks best since we are building a tree.
Additionally, the algorithm is not sensitive to the choice of distance metric; all of them tend to work equally well whereas with other clustering algorithms, the choice of distance metric is critical.
A particularly good use case of hierarchical clustering methods is when the underlying data has a hierarchical structure and you want to recover the hierarchy; other clustering algorithms can’t do this.
These advantages of hierarchical clustering come at the cost of lower efficiency, as it has a time complexity of O(n³), unlike the linear complexity of KMeans and GMM.
For a list of all possible options for Agglomerative clustering please visit: https://scikitlearn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering
Parameters:  model_name (str, optional) – Name for this model, by default “agglom”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 n_clusters (int or None, optional (default=2)) – The number of clusters to find. It must be None if distance_threshold is not None.
 affinity (string or callable, default: “euclidean”) – Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. If linkage is “ward”, only “euclidean” is accepted. If “precomputed”, a distance matrix (instead of a similarity matrix) is needed as input for the fit method.
 compute_full_tree (bool or ‘auto’ (optional)) – Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be True if distance_threshold is not None.
 linkage ({“ward”, “complete”, “average”, “single”}, optional (default=”ward”)) –
Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.
’ward’ minimizes the variance of the clusters being merged. ‘average’ uses the average of the distances of each observation of the two sets. ‘complete’ or maximum linkage uses the maximum distances between all observations of the two sets. ‘single’ uses the minimum of the distances between all observations of the two sets.  distance_threshold (float, optional (default=None)) – The linkage distance threshold above which, clusters will not be merged. If not None, n_clusters must be None and compute_full_tree must be True.
Returns: UnsupervisedModelAnalysis object to view results and further analysis
Return type: Examples
>>> model.AgglomerativeClustering() >>> model.AgglomerativeClustering(model_name='ag_1, n_clusters=5) >>> model.AgglomerativeClustering(run=False) # Add model to the queue

DBScan
(model_name='dbs', run=True, verbose=1, **kwargs)¶ Based on a set of points (let’s think in a bidimensional space as exemplified in the figure), DBSCAN groups together points that are close to each other based on a distance measurement (usually Euclidean distance) and a minimum number of points. It also marks as outliers the points that are in lowdensity regions.
The DBSCAN algorithm should be used to find associations and structures in data that are hard to find manually but that can be relevant and useful to find patterns and predict trends.
For a list of all possible options for DBSCAN please visit: https://scikitlearn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
Parameters:  model_name (str, optional) – Name for this model, by default “dbscan”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 eps (float) – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
 min_samples (int, optional) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
 metric (string, or callable) – The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by sklearn. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.
 p (float, optional) – The power of the Minkowski metric to be used to calculate distance between points.
 n_jobs (int or None, optional (default=None)) – The number of parallel jobs to run. None means 1 unless in a joblib.parallel_backend context. 1 means using all processors. See Glossary for more details.
Returns: UnsupervisedModelAnalysis object to view results and further analysis
Return type: Examples
>>> model.DBScan() >>> model.DBScan(model_name='dbs_1, min_samples=5) >>> model.DBScan(run=False) # Add model to the queue

GaussianMixtureClustering
(model_name='gm_cluster', run=True, verbose=1, **kwargs)¶ Trains a GaussianMixture algorithm that implements the expectationmaximization algorithm for fitting mixture of Gaussian models.
A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.
There are 2 key advantages to using GMMs.
Firstly GMMs are a lot more flexible in terms of cluster covariance than KMeans; due to the standard deviation parameter, the clusters can take on any ellipse shape, rather than being restricted to circles.
KMeans is actually a special case of GMM in which each cluster’s covariance along all dimensions approaches 0. Secondly, since GMMs use probabilities, they can have multiple clusters per data point.
So if a data point is in the middle of two overlapping clusters, we can simply define its class by saying it belongs Xpercent to class 1 and Ypercent to class 2. I.e GMMs support mixed membership.
For more information on Gaussian Mixture algorithms please visit: https://scikitlearn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture
Parameters:  model_name (str, optional) – Name for this model, by default “gm_cluster”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 n_components (int, defaults to 1.) – The number of mixture components/ number of unique y_train values.
 covariance_type ({‘full’ (default), ‘tied’, ‘diag’, ‘spherical’}) –
String describing the type of covariance parameters to use. Must be one of:
 ‘full’
 each component has its own general covariance matrix
 ‘tied’
 all components share the same general covariance matrix
 ‘diag’
 each component has its own diagonal covariance matrix
 ‘spherical’
 each component has its own single variance
 tol (float, defaults to 1e3.) – The convergence threshold. EM iterations will stop when the lower bound average gain is below this threshold.
 reg_covar (float, defaults to 1e6.) – Nonnegative regularization added to the diagonal of covariance. Allows to assure that the covariance matrices are all positive.
 max_iter (int, defaults to 100.) – The number of EM iterations to perform.
 n_init (int, defaults to 1.) – The number of initializations to perform. The best results are kept.
 init_params ({‘kmeans’, ‘random’}, defaults to ‘kmeans’.) –
The method used to initialize the weights, the means and the precisions. Must be one of:
’kmeans’ : responsibilities are initialized using kmeans. ‘random’ : responsibilities are initialized randomly.
 weights_init (arraylike, shape (n_components, ), optional) – The userprovided initial weights If it None, weights are initialized using the init_params method. Defaults to None.
 means_init (arraylike, shape (n_components, n_features), optional) – The userprovided initial means If it None, means are initialized using the init_params method. Defaults to None
 precisions_init (arraylike, optional.) –
The userprovided initial precisions (inverse of the covariance matrices), defaults to None. If it None, precisions are initialized using the ‘init_params’ method. The shape depends on ‘covariance_type’:
(n_components,) if ‘spherical’, (n_features, n_features) if ‘tied’, (n_components, n_features) if ‘diag’, (n_components, n_features, n_features) if ‘full’
Returns: UnsupervisedModelAnalysis object to view results and further analysis
Return type: Examples
>>> model.GuassianMixtureClustering() >>> model.GuassianMixtureClustering(model_name='gm_1, max_iter=1000) >>> model.GuassianMixtureClustering(run=False) # Add model to the queue

IsolationForest
(model_name='iso_forest', run=True, verbose=1, **kwargs)¶ Isolation Forest Algorithm
Return the anomaly score of each sample using the IsolationForest algorithm
The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
For more Isolation Forest info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html#sklearn.ensemble.IsolationForest
Parameters:  model_name (str, optional) – Name for this model, by default “iso_forest”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 n_estimators (int, optional (default=100)) – The number of base estimators in the ensemble.
 max_samples (int or float, optional (default=”auto”)) –
The number of samples to draw from X to train each base estimator.
If int, then draw max_samples samples. If float, then draw max_samples * X.shape[0] samples. If “auto”, then max_samples=min(256, n_samples).If max_samples is larger than the number of samples provided, all samples will be used for all trees (no sampling).
 contamination (float in (0., 0.5), optional (default=0.1)) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function. If ‘auto’, the decision function threshold is determined as in the original paper.
 max_features (int or float, optional (default=1.0)) –
The number of features to draw from X to train each base estimator.
If int, then draw max_features features. If float, then draw max_features * X.shape[1] features.  bootstrap (boolean, optional (default=False)) – If True, individual trees are fit on random subsets of the training data sampled with replacement. If False, sampling without replacement is performed.
Returns: UnsupervisedModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.IsolationForest() >>> model.IsolationForest(model_name='iso_1, max_features=5) >>> model.IsolationForest(run=False) # Add model to the queue

KMeans
(model_name='km', run=True, verbose=1, **kwargs)¶ NOTE: If ‘n_clusters’ is not provided, k will automatically be determined using an elbow plot using distortion as the mteric to find the optimal number of clusters.
Kmeans clustering is one of the simplest and popular unsupervised machine learning algorithms.
The objective of Kmeans is simple: group similar data points together and discover underlying patterns. To achieve this objective, Kmeans looks for a fixed number (k) of clusters in a dataset.
In other words, the Kmeans algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.
For a list of all possible options for K Means clustering please visit: https://scikitlearn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Parameters:  model_name (str, optional) – Name for this model, by default “kmeans”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 n_clusters (int, optional, default: 8) – The number of clusters to form as well as the number of centroids to generate.
 init ({‘kmeans++’, ‘random’ or an ndarray}) –
 Method for initialization, defaults to ‘kmeans++’:
 ‘kmeans++’ : selects initial cluster centers for kmean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
‘random’: choose k observations (rows) at random from data for the initial centroids.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
 n_init (int, default: 10) – Number of time the kmeans algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
 max_iter (int, default: 300) – Maximum number of iterations of the kmeans algorithm for a single run.
 random_state (int, RandomState instance or None (default)) – Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See Glossary.
 algorithm (“auto”, “full” or “elkan”, default=”auto”) – Kmeans algorithm to use. The classical EMstyle algorithm is “full”. The “elkan” variation is more efficient by using the triangle inequality, but currently doesn’t support sparse data. “auto” chooses “elkan” for dense data and “full” for sparse data.
Returns: UnsupervisedModelAnalysis object to view results and further analysis
Return type: Examples
>>> model.KMeans() >>> model.KMeans(model_name='kmean_1, n_cluster=5) >>> model.KMeans(run=False) # Add model to the queue

MeanShift
(model_name='mshift', run=True, **kwargs)¶ Trains a Mean Shift clustering algorithm.
Mean shift clustering aims to discover “blobs” in a smooth density of samples.
It is a centroidbased algorithm, which works by updating candidates for centroids to be the mean of the points within a given region.
These candidates are then filtered in a postprocessing stage to eliminate nearduplicates to form the final set of centroids.
For more info on Mean Shift clustering please visit: https://scikitlearn.org/stable/modules/generated/sklearn.cluster.MeanShift.html#sklearn.cluster.MeanShift
Parameters:  model_name (str, optional) – Name for this model, by default “mshift”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 bandwidth (float, optional) –
Bandwidth used in the RBF kernel.
If not given, the bandwidth is estimated using sklearn.cluster.estimate_bandwidth; see the documentation for that function for hints on scalability (see also the Notes, below).
 seeds (array, shape=[n_samples, n_features], optional) – Seeds used to initialize kernels. If not set, the seeds are calculated by clustering.get_bin_seeds with bandwidth as the grid size and default values for other parameters.
 bin_seeding (boolean, optional) – If true, initial kernel locations are not locations of all points, but rather the location of the discretized version of points, where points are binned onto a grid whose coarseness corresponds to the bandwidth. Setting this option to True will speed up the algorithm because fewer seeds will be initialized. default value: False Ignored if seeds argument is not None.
 min_bin_freq (int, optional) – To speed up the algorithm, accept only those bins with at least min_bin_freq points as seeds. If not defined, set to 1.
 cluster_all (boolean, default True) – If true, then all points are clustered, even those orphans that are not within any kernel. Orphans are assigned to the nearest kernel. If false, then orphans are given cluster label 1.
Returns: UnsupervisedModelAnalysis object to view results and further analysis
Return type: Examples
>>> model.MeanShift() >>> model.MeanShift(model_name='ms_1', cluster_all=False) >>> model.MeanShift(run=False) # Add model to the queue

OneClassSVM
(model_name='ocsvm', run=True, verbose=1, **kwargs)¶ Trains a One Class SVM model.
Unsupervised Outlier Detection.
For more Support Vector info, you can view it here: https://scikitlearn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html#sklearn.svm.OneClassSVM
Parameters:  model_name (str, optional) – Name for this model, by default “ocsvm”
 run (bool, optional) – Whether to train the model or just initialize it with parameters (useful when wanting to test multiple models at once) , by default False
 verbose (int, optional) – Verbosity level of model output, the higher the number  the more verbose. By default, 1
 kernel (string, optional (default=’rbf’)) – Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix.
 degree (int, optional (default=3)) – Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
 gamma (float, optional (default=’auto’)) – Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. Current default is ‘auto’ which uses 1 / n_features, if gamma=’scale’ is passed then it uses 1 / (n_features * X.var()) as value of gamma.
 coef0 (float, optional (default=0.0)) – Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.
 tol (float, optional) – Tolerance for stopping criterion.
 nu (float, optional) – An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.
 shrinking (boolean, optional) – Whether to use the shrinking heuristic.
 cache_size (float, optional) – Specify the size of the kernel cache (in MB).
 max_iter (int, optional (default=1)) – Hard limit on iterations within solver, or 1 for no limit.
Returns: UnsupervisedModelAnalysis object to view results and analyze results
Return type: Examples
>>> model.OneClassSVM() >>> model.OneClassSVM(model_name='ocs_1, max_iter=100) >>> model.OneClassSVM(run=False) # Add model to the queue

Model Analysis API¶

class
aethos.model_analysis.model_analysis.
ModelAnalysisBase
¶ Bases:
aethos.visualizations.visualizations.Visualizations
,aethos.stats.stats.Stats

test_results
¶

to_pickle
()¶ Writes model to a pickle file.
Examples
>>> m = Model(df) >>> m_results = m.LogisticRegression() >>> m_results.to_pickle()

to_service
(project_name: str)¶ Creates an app.py, requirements.txt and Dockerfile in ~/.aethos/projects and the necessary folder structure to run the model as a microservice.
Parameters: project_name (str) – Name of the project that you want to create. Examples
>>> m = Model(df) >>> m_results = m.LogisticRegression() >>> m_results.to_service('your_proj_name')

train_results
¶


class
aethos.model_analysis.model_analysis.
SupervisedModelAnalysis
(model, x_train, x_test, y_train, y_test, model_name)¶ Bases:
aethos.model_analysis.model_analysis.ModelAnalysisBase

decision_plot
(num_samples=0.6, sample_no=None, highlight_misclassified=False, output_file='', **decisionplot_kwargs)¶ Visualize model decisions using cumulative SHAP values.
Each colored line in the plot represents the model prediction for a single observation.
Note that plotting too many samples at once can make the plot unintelligible.
 When is a decision plot useful:
 Show a large number of feature effects clearly.
 Visualize multioutput predictions.
 Display the cumulative effect of interactions.
 Explore feature effects for a range of feature values.
 Identify outliers.
 Identify typical prediction paths.
 Compare and contrast predictions for several models.
 Explanation:
 The plot is centered on the xaxis at the models expected value.
 All SHAP values are relative to the model’s expected value like a linear model’s effects are relative to the intercept.
 The yaxis lists the model’s features. By default, the features are ordered by descending importance.
 The importance is calculated over the observations plotted. This is usually different than the importance ordering for the entire dataset. In addition to feature importance ordering, the decision plot also supports hierarchical cluster feature ordering and userdefined feature ordering.
 Each observation’s prediction is represented by a colored line.
 At the top of the plot, each line strikes the xaxis at its corresponding observation’s predicted value. This value determines the color of the line on a spectrum.
 Moving from the bottom of the plot to the top, SHAP values for each feature are added to the model’s base value. This shows how each feature contributes to the overall prediction.
 At the bottom of the plot, the observations converge at the models expected value.
Parameters:  output_file (str) – Output file name including extension (.png, .jpg, etc.) to save image as.
 num_samples (int, float, or 'all', optional) – Number of samples to display, if less than 1 it will treat it as a percentage, ‘all’ will include all samples , by default 0.6
 sample_no (int, optional) – Sample number to isolate and analyze, if provided it overrides num_samples, by default None
 highlight_misclassified (bool, optional) – True to highlight the misclassified results, by default False
 feature_order (str or None or list or numpy.ndarray) – Any of “importance” (the default), “hclust” (hierarchical clustering), “none”, or a list/array of indices. hclust is useful for finding outliers.
 feature_display_range (slice or range) – The slice or range of features to plot after ordering features by feature_order. A step of 1 or None will display the features in ascending order. A step of 1 will display the features in descending order. If feature_display_range=None, slice(1, 21, 1) is used (i.e. show the last 20 features in descending order). If shap_values contains interaction values, the number of features is automatically expanded to include all possible interactions: N(N + 1)/2 where N = shap_values.shape[1].
 highlight (Any) – Specify which observations to draw in a different line style. All numpy indexing methods are supported. For example, list of integer indices, or a bool array.
 link (str) – Use “identity” or “logit” to specify the transformation used for the xaxis. The “logit” link transforms logodds into probabilities.
 plot_color (str or matplotlib.colors.ColorMap) – Color spectrum used to draw the plot lines. If str, a registered matplotlib color name is assumed.
 axis_color (str or int) – Color used to draw plot axes.
 y_demarc_color (str or int) – Color used to draw feature demarcation lines on the yaxis.
 alpha (float) – Alpha blending value in [0, 1] used to draw plot lines.
 color_bar (bool) – Whether to draw the color bar.
 auto_size_plot (bool) – Whether to automatically size the matplotlib plot to fit the number of features displayed. If False, specify the plot size using matplotlib before calling this function.
 title (str) – Title of the plot.
 xlim (tuple[float, float]) – The extents of the xaxis (e.g. (1.0, 1.0)). If not specified, the limits are determined by the maximum/minimum predictions centered around base_value when link=’identity’. When link=’logit’, the xaxis extents are (0, 1) centered at 0.5. x_lim values are not transformed by the link function. This argument is provided to simplify producing multiple plots on the same scale for comparison.
 show (bool) – Whether to automatically display the plot.
 return_objects (bool) – Whether to return a DecisionPlotResult object containing various plotting features. This can be used to generate multiple decision plots using the same feature ordering and scale, by default True.
 ignore_warnings (bool) – Plotting many data points or too many features at a time may be slow, or may create very large plots. Set this argument to True to override hardcoded limits that prevent plotting large amounts of data.
 new_base_value (float) – SHAP values are relative to a base value; by default, the expected value of the model’s raw predictions. Use new_base_value to shift the base value to an arbitrary value (e.g. the cutoff point for a binary classification task).
 legend_labels (list of str) – List of legend labels. If None, legend will not be shown.
 legend_location (str) – Legend location. Any of “best”, “upper right”, “upper left”, “lower left”, “lower right”, “right”, “center left”, “center right”, “lower center”, “upper center”, “center”.
Returns: If return_objects=True (the default). Returns None otherwise.
Return type: DecisionPlotResult
Examples
>>> # Plot two decision plots using the same feature order and xaxis. >>> m = model.LogisticRegression() >>> r = m.decision_plot() >>> m.decision_plot(no_sample=42, feature_order=r.feature_idx, xlim=r.xlim)

dependence_plot
(feature: str, interaction='auto', output_file='', **dependenceplot_kwargs)¶ A dependence plot is a scatter plot that shows the effect a single feature has on the predictions made by the mode.
 Explanation:
 Each dot is a single prediction (row) from the dataset.
 The xaxis is the value of the feature (from the X matrix).
 The yaxis is the SHAP value for that feature, which represents how much knowing that feature’s value changes the output of the model for that sample’s prediction.
 The color corresponds to a second feature that may have an interaction effect with the feature we are plotting (by default this second feature is chosen automatically).
 If an interaction effect is present between this other feature and the feature we are plotting it will show up as a distinct vertical pattern of coloring.
Parameters:  feature (str) – Feature who’s impact on the model you want to analyze
 interaction ("auto", None, int, or string) – The index of the feature used to color the plot. The name of a feature can also be passed as a string. If “auto” then shap.common.approximate_interactions is used to pick what seems to be the strongest interaction (note that to find to true stongest interaction you need to compute the SHAP interaction values).
 output_file (str) – Output file name including extension (.png, .jpg, etc.) to save image as.
 x_jitter (float (0  1)) – Adds random jitter to feature values. May increase plot readability when feature is discrete.
 alpha (float) – The transparency of the data points (between 0 and 1). This can be useful to the show density of the data points when using a large dataset.
 xmin (float or string) – Represents the lower bound of the plot’s xaxis. It can be a string of the format “percentile(float)” to denote that percentile of the feature’s value used on the xaxis.
 xmax (float or string) – Represents the upper bound of the plot’s xaxis. It can be a string of the format “percentile(float)” to denote that percentile of the feature’s value used on the xaxis.
 ax (matplotlib Axes object) – Optionally specify an existing matplotlib Axes object, into which the plot will be placed. In this case we do not create a Figure, otherwise we do.
 cmap (str or matplotlib.colors.ColorMap) – Color spectrum used to draw the plot lines. If str, a registered matplotlib color name is assumed.
Examples
>>> m = model.LogisticRegression() >>> m.dependence_plot()

force_plot
(sample_no=None, misclassified=False, output_file='', **forceplot_kwargs)¶ Visualize the given SHAP values with an additive force layout
Parameters:  sample_no (int, optional) – Sample number to isolate and analyze, by default None
 misclassified (bool, optional) – True to only show the misclassified results, by default False
 output_file (str) – Output file name including extension (.png, .jpg, etc.) to save image as.
 link ("identity" or "logit") – The transformation used when drawing the tick mark labels. Using logit will change logodds numbers into probabilities.
 matplotlib (bool) – Whether to use the default Javascript output, or the (less developed) matplotlib output. Using matplotlib can be helpful in scenarios where rendering Javascript/HTML is inconvenient.
Examples
>>> m = model.LogisticRegression() >>> m.force_plot() # The entire test dataset >>> m.forceplot(no_sample=1, misclassified=True) # Analyze the first misclassified result

interpret_model
(show=True)¶ Displays a dashboard interpreting your model’s performance, behaviour and individual predictions.
If you have run any other interpret functions, they will be included in the dashboard, otherwise all the other intrepretable methods will be included in the dashboard.
Examples
>>> m = model.LogisticRegression() >>> m.interpret_model()

interpret_model_behavior
(method='all', predictions='default', show=True, **interpret_kwargs)¶ Provides an interpretable summary of your models behaviour based off an explainer.
Can either be ‘morris’ or ‘dependence’ for Partial Dependence.
If ‘all’ a dashboard is displayed with morris and dependence analysis displayed.
Parameters:  method (str, optional) – Explainer type, can either be ‘all’, ‘morris’ or ‘dependence’, by default ‘all’
 predictions (str, optional) – Prediction type, can either be ‘default’ (.predict) or ‘probability’ if the model can predict probabilities, by default ‘default’
 show (bool, optional) – False to not display the plot, by default True
Examples
>>> m = model.LogisticRegression() >>> m.interpret_model_behavior()

interpret_model_performance
(method='all', predictions='default', show=True, **interpret_kwargs)¶ Plots an interpretable display of your model based off a performance metric.
Can either be ‘ROC’ or ‘PR’ for precision, recall for classification problems.
Can be ‘regperf’ for regression problems.
If ‘all’ a dashboard is displayed with the corresponding explainers for the problem type.
ROC: Receiver Operator Characteristic PR: Precision Recall regperf: RegeressionPerf
Parameters:  method (str) – Performance metric, either ‘all’, ‘roc’ or ‘PR’, by default ‘all’
 predictions (str, optional) – Prediction type, can either be ‘default’ (.predict) or ‘probability’ if the model can predict probabilities, by default ‘default’
 show (bool, optional) – False to not display the plot, by default True
Examples
>>> m = model.LogisticRegression() >>> m.interpret_model_performance()

interpret_model_predictions
(num_samples=0.25, sample_no=None, method='all', predictions='default', show=True, **interpret_kwargs)¶ Plots an interpretable display that explains individual predictions of your model.
Supported explainers are either ‘lime’ or ‘shap’.
If ‘all’ a dashboard is displayed with morris and dependence analysis displayed.
Parameters:  num_samples (int, float, or 'all', optional) – Number of samples to display, if less than 1 it will treat it as a percentage, ‘all’ will include all samples , by default 0.25
 sample_no (int, optional) – Sample number to isolate and analyze, if provided it overrides num_samples, by default None
 method (str, optional) – Explainer type, can either be ‘all’, ‘lime’, or ‘shap’, by default ‘all’
 predictions (str, optional) – Prediction type, can either be ‘default’ (.predict) or ‘probability’ if the model can predict probabilities, by default ‘default’
 show (bool, optional) – False to not display the plot, by default True
Examples
>>> m = model.LogisticRegression() >>> m.interpret_model_predictions()

model_weights
()¶ Prints and logs all the features ranked by importance from most to least important.
Returns: Dictionary of features and their corresponding weights Return type: dict Raises: AttributeError
– If model does not have coefficients to displayExamples
>>> m = model.LogisticRegression() >>> m.model_weights()

shap_get_misclassified_index
()¶ Prints the sample numbers of misclassified samples.
Examples
>>> m = model.LogisticRegression() >>> m.shap_get_misclassified_index()

summary_plot
(output_file='', **summaryplot_kwargs)¶ Create a SHAP summary plot, colored by feature values when they are provided.
For a list of all kwargs please see the Shap documentation : https://shap.readthedocs.io/en/latest/#plots
Parameters:  output_file (str) – Output file name including extension (.png, .jpg, etc.) to save image as.
 max_display (int) – How many top features to include in the plot (default is 20, or 7 for interaction plots), by default None
 plot_type ("dot" (default for single output), "bar" (default for multioutput), "violin", or "compact_dot") – What type of summary plot to produce. Note that “compact_dot” is only used for SHAP interaction values.
 color (str or matplotlib.colors.ColorMap) – Color spectrum used to draw the plot lines. If str, a registered matplotlib color name is assumed.
 axis_color (str or int) – Color used to draw plot axes.
 title (str) – Title of the plot.
 alpha (float) – Alpha blending value in [0, 1] used to draw plot lines.
 show (bool) – Whether to automatically display the plot.
 sort (bool) – Whether to sort features by importance, by default True
 color_bar (bool) – Whether to draw the color bar.
 auto_size_plot (bool) – Whether to automatically size the matplotlib plot to fit the number of features displayed. If False, specify the plot size using matplotlib before calling this function.
 layered_violin_max_num_bins (int) – Max number of bins, by default 20
 **summaryplot_kwargs – For more info see https://shap.readthedocs.io/en/latest/#plots
Examples
>>> m = model.LogisticRegression() >>> m.summary_plot()

view_tree
(tree_num=0, output_file=None, **kwargs)¶ Plot decision trees.
Parameters:  tree_num (int, optional) – For ensemble, boosting, and stacking methods  the tree number to plot, by default 0
 output_file (str, optional) – Name of the file including extension, by default None
Examples
>>> m = model.DecisionTreeClassifier() >>> m.view_tree() >>> m = model.XGBoostClassifier() >>> m.view_tree(2)


class
aethos.model_analysis.classification_model_analysis.
ClassificationModelAnalysis
(model, x_train, x_test, target, model_name)¶ Bases:
aethos.model_analysis.model_analysis.SupervisedModelAnalysis

accuracy
(**kwargs)¶ It measures how many observations, both positive and negative, were correctly classified.
Returns: Accuracy Return type: float Examples
>>> m = model.LogisticRegression() >>> m.accuracy()

average_precision
(**kwargs)¶ AP summarizes a precisionrecall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight
Returns: Average Precision Score Return type: float Examples
>>> m = model.LogisticRegression() >>> m.average_precision()

balanced_accuracy
(**kwargs)¶ The balanced accuracy in binary and multiclass classification problems to deal with imbalanced datasets. It is defined as the average of recall obtained on each class.
The best value is 1 and the worst value is 0 when adjusted=False.
Returns: Balanced accuracy Return type: float Examples
>>> m = model.LogisticRegression() >>> m.balanced_accuracy()

brier_loss
(**kwargs)¶ Compute the Brier score. The smaller the Brier score, the better, hence the naming with “loss”. Across all items in a set N predictions, the Brier score measures the mean squared difference between (1) the predicted probability assigned to the possible outcomes for item i, and (2) the actual outcome. Therefore, the lower the Brier score is for a set of predictions, the better the predictions are calibrated.
The Brier score is appropriate for binary and categorical outcomes that can be structured as true or false, but is inappropriate for ordinal variables which can take on three or more values (this is because the Brier score assumes that all possible outcomes are equivalently “distant” from one another)
Returns: Brier loss Return type: float Examples
>>> m = model.LogisticRegression() >>> m.brier_loss()

classification_report
()¶ Prints and logs the classification report.
The classification report displays and logs the information in this format:
precision recall f1score support
1 1.00 0.67 0.80 3 2 0.00 0.00 0.00 0 3 0.00 0.00 0.00 0
micro avg 1.00 0.67 0.80 3 macro avg 0.33 0.22 0.27 3
weighted avg 1.00 0.67 0.80 3
Examples
>>> m = model.LogisticRegression() >>> m.classification_report()

cohen_kappa
(**kwargs)¶ Cohen Kappa tells you how much better is your model over the random classifier that predicts based on class frequencies
This measure is intended to compare labelings by different human annotators, not a classifier versus a ground truth.
The kappa score (see docstring) is a number between 1 and 1. Scores above .8 are generally considered good agreement; zero or lower means no agreement (practically random labels).
Returns: Cohen Kappa score. Return type: float Examples
>>> m = model.LogisticRegression() >>> m.cohen_kappa()

confusion_matrix
(title=None, normalize=False, hide_counts=False, x_tick_rotation=0, figsize=None, cmap='Blues', title_fontsize='large', text_fontsize='medium', output_file='')¶ Prints a confusion matrix as a heatmap.
Parameters:  title (str) – The text to display at the top of the matrix, by default ‘Confusion Matrix’
 normalize (bool) – If False, plot the raw numbers If True, plot the proportions, by default False
 hide_counts (bool) – If False, display the counts and percentage If True, hide display of the counts and percentage by default, False
 x_tick_rotation (int) – Degree of rotation to rotate the x ticks by default, 0
 figsize (tuple(int, int)) – Size of the figure by default, None
 cmap (str) – The gradient of the values displayed from matplotlib.pyplot.cm see http://matplotlib.org/examples/color/colormaps_reference.html plt.get_cmap(‘jet’) or plt.cm.Blues by default, ‘Blues’
 title_fontsize (str) – Size of the title, by default ‘large’
 text_fontsize (str) – Size of the text of the rest of the plot, by default ‘medium’
 output_file (str) – Output file name including extension (.png, .jpg, etc.) to save image as.
Examples
>>> m = model.LogisticRegression() >>> m.confusion_matrix() >>> m.confusion_matrix(normalize=True)

cross_validate
(cv_type='stratkfold', score='accuracy', n_splits=5, shuffle=False, **kwargs)¶ Runs cross validation on a Classification model.
 Scoring Metrics:
 ‘accuracy’
 ‘balanced_accuracy’
 ‘average_precision’
 ‘brier_score_loss’
 ‘f1’
 ‘f1_micro’
 ‘f1_macro’
 ‘f1_weighted’
 ‘f1_samples’
 ‘neg_log_loss’
 ‘precision’
 ‘recall’
 ‘jaccard’
 ‘roc_auc’’
 ‘roc_auc_ovr’
 ‘roc_auc_ovo’
 ‘roc_auc_ovr_weighted’
 ‘roc_auc_ovo_weighted’
Parameters:  cv_type ({kfold, stratkfold}, optional) – Crossvalidation type, by default “kfold”
 score (str, optional) – Scoring metric, by default “accuracy”
 n_splits (int, optional) – Number of times to split the data, by default 5
 shuffle (bool, optional) – True to shuffle the data, by default False

decision_boundary
(x=None, y=None, title='Decisioun Boundary')¶ Plots a decision boundary for a given model.
If no x or y columns are provided, it defaults to the first 2 columns of your data.
Parameters:  x (str, optional) – Column in the dataframe to plot, Feature one, by default None
 y (str, optional) – Column in the dataframe to plot, Feature two, by default None
 title (str, optional) – Title of the decision boundary plot, by default “Decisioun Boundary”

f1
(**kwargs)¶ The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:
F1 = 2 * (precision * recall) / (precision + recall)
In the multiclass and multilabel case, this is the average of the F1 score of each class with weighting depending on the average parameter.
Returns: F1 Score Return type: float Examples
>>> m = model.LogisticRegression() >>> m.f1()

fbeta
(beta=0.5, **kwargs)¶ The Fbeta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. The beta parameter determines the weight of recall in the combined score. Beta < 1 lends more weight to precision, while beta > 1 favors recall (beta > 0 considers only precision, beta > inf only recall).
Parameters: beta (float, optional) – Weight of precision in harmonic mean, by default 0.5 Returns: Fbeta score Return type: float Examples
>>> m = model.LogisticRegression() >>> m.fbeta()

hamming_loss
(**kwargs)¶ The Hamming loss is the fraction of labels that are incorrectly predicted.
Returns: Hamming loss Return type: float Examples
>>> m = model.LogisticRegression() >>> m.hamming_loss()

hinge_loss
(**kwargs)¶ Computes the average distance between the model and the data using hinge loss, a onesided metric that considers only prediction errors.
Returns: Hinge loss Return type: float Examples
>>> m = model.LogisticRegression() >>> m.hinge_loss()

jaccard
(**kwargs)¶ The Jaccard index, or Jaccard similarity coefficient, defined as the size of the intersection divided by the size of the union of two label sets, is used to compare set of predicted labels for a sample to the corresponding set of labels in y_true.
Returns: Jaccard Score Return type: float Examples
>>> m = model.LogisticRegression() >>> m.jaccard()

log_loss
(**kwargs)¶ Log loss, aka logistic loss or crossentropy loss.
This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative loglikelihood of the true labels given a probabilistic classifier’s predictions.
Returns: Log loss Return type: Float Examples
>>> m = model.LogisticRegression() >>> m.log_loss()

matthews_corr_coef
(**kwargs)¶ The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary and multiclass classifications. It takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient value between 1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and 1 an inverse prediction. The statistic is also known as the phi coefficient.
Returns: Matthews Correlation Coefficient Return type: float Examples
>>> m = model.LogisticRegression() >>> m.mathews_corr_coef()

metrics
(*metrics)¶ Measures how well your model performed against certain metrics.
For multiclassification problems, the ‘macro’ average is used.
If a project metrics has been specified, it will display those metrics, otherwise it will display the specified metrics or all metrics.
For more detailed information and parameters please see the following link: https://scikitlearn.org/stable/modules/classes.html#classificationmetrics
Supported metrics are:
‘Accuracy’: ‘Measures how many observations, both positive and negative, were correctly classified.’,
‘Balanced Accuracy’: ‘The balanced accuracy in binary and multiclass classification problems to deal with imbalanced datasets. It is defined as the average of recall obtained on each class.’,
‘Average Precision’: ‘Summarizes a precisionrecall curve as the weighted mean of precisions achieved at each threshold’,
‘ROC AUC’: ‘Shows how good at ranking predictions your model is. It tells you what is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.’,
‘Zero One Loss’: ‘Fraction of misclassifications.’,
‘Precision’: ‘It measures how many observations predicted as positive are positive. Good to use when False Positives are costly.’,
‘Recall’: ‘It measures how many observations out of all positive observations have we classified as positive. Good to use when catching call positive occurences, usually at the cost of false positive.’,
‘Matthews Correlation Coefficient’: ‘It’s a correlation between predicted classes and ground truth.’,
‘Log Loss’: ‘Difference between ground truth and predicted score for every observation and average those errors over all observations.’,
‘Jaccard’: ‘Defined as the size of the intersection divided by the size of the union of two label sets, is used to compare set of predicted labels for a sample to the corresponding set of true labels.’,
‘Hinge Loss’: ‘Computes the average distance between the model and the data using hinge loss, a onesided metric that considers only prediction errors.’,
‘Hamming Loss’: ‘The Hamming loss is the fraction of labels that are incorrectly predicted.’,
‘FBeta’: ‘It’s the harmonic mean between precision and recall, with an emphasis on one or the other. Takes into account both metrics, good for imbalanced problems (spam, fraud, etc.).’,
‘F1’: ‘It’s the harmonic mean between precision and recall. Takes into account both metrics, good for imbalanced problems (spam, fraud, etc.).’,
‘Cohen Kappa’: ‘Cohen Kappa tells you how much better is your model over the random classifier that predicts based on class frequencies. Works well for imbalanced problems.’,
‘Brier Loss’: ‘It is a measure of how far your predictions lie from the true values. Basically, it is a mean square error in the probability space.’
Parameters: metrics (str(s), optional) – Specific type of metrics to view Examples
>>> m = model.LogisticRegression() >>> m.metrics() >>> m.metrics('F1', 'FBeta')

precision
(**kwargs)¶ The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives.
The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
The best value is 1 and the worst value is 0.
Returns: Precision Return type: float Examples
>>> m = model.LogisticRegression() >>> m.precision()

recall
(**kwargs)¶ The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives.
The recall is intuitively the ability of the classifier to find all the positive samples.
The best value is 1 and the worst value is 0.
Returns: Recall Return type: float Examples
>>> m = model.LogisticRegression() >>> m.recall()

roc_auc
(**kwargs)¶ This metric tells us that this metric shows how good at ranking predictions your model is. It tells you what is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.
Returns: ROC AUC Score Return type: float Examples
>>> m = model.LogisticRegression() >>> m.roc_auc()

roc_curve
(title=True, output_file='')¶ Plots an ROC curve and displays the ROC statistics (area under the curve).
Parameters:  figsize (tuple(int, int), optional) – Figure size, by default (600,450)
 title (bool) – Whether to display title, by default True
 output_file (str, optional) – If a name is provided save the plot to an html file, by default ‘’
Examples
>>> m = model.LogisticRegression() >>> m.roc_curve()

zero_one_loss
(**kwargs)¶ Return the fraction of misclassifications (float), else it returns the number of misclassifications (int).
The best performance is 0.
Returns: Zero one loss Return type: float Examples
>>> m = model.LogisticRegression() >>> m.zero_one_loss()


class
aethos.model_analysis.regression_model_analysis.
RegressionModelAnalysis
(model, x_train, x_test, target, model_name)¶ Bases:
aethos.model_analysis.model_analysis.SupervisedModelAnalysis

cross_validate
(cv_type='kfold', score='neg_root_mean_squared_error', n_splits=5, shuffle=False, **kwargs)¶ Runs cross validation on a Regression model.
 Scoring Metrics:
 ‘explained_variance’
 ‘max_error’
 ‘neg_mean_absolute_error’ –> MAE
 ‘neg_mean_squared_error’ –> MSE
 ‘neg_mean_squared_log_error’ –> MSLE
 ‘neg_median_absolute_error’ –> MeAE
 ‘r2’
 ‘neg_mean_poisson_deviance’
 ‘neg_mean_gamma_deviance’
Parameters:  cv_type ({kfold, stratkfold}, optional) – Crossvalidation type, by default “kfold”
 score (str, optional) – Scoring metric, by default “accuracy”
 n_splits (int, optional) – Number of times to split the data, by default 5
 shuffle (bool, optional) – True to shuffle the data, by default False

explained_variance
(multioutput='uniform_average', **kwargs)¶ Explained variance regression score function
Best possible score is 1.0, lower values are worse.
Parameters: multioutput (string in [‘raw_values’, ‘uniform_average’, ‘variance_weighted’] or arraylike of shape (n_outputs)) – Defines aggregating of multiple output scores. Arraylike value defines weights used to average scores.
 ‘raw_values’ :
 Returns a full set of scores in case of multioutput input.
 ‘uniform_average’ :
 Scores of all outputs are averaged with uniform weight.
 ‘variance_weighted’ :
 Scores of all outputs are averaged, weighted by the variances of each individual output.
By default ‘uniform_average’
Returns: Explained Variance Return type: float Examples
>>> m = model.LinearRegression() >>> m.explained_variance()

max_error
()¶ Returns the single most maximum residual error.
Returns: Max error Return type: float Examples
>>> m = model.LinearRegression() >>> m.max_error()

mean_abs_error
(**kwargs)¶ Mean absolute error.
Returns: Mean absolute error. Return type: float Examples
>>> m = model.LinearRegression() >>> m.mean_abs_error()

mean_sq_error
(**kwargs)¶ Mean squared error.
Returns: Mean squared error. Return type: float Examples
>>> m = model.LinearRegression() >>> m.mean_sq_error()

mean_sq_log_error
(**kwargs)¶ Mean squared log error.
Returns: Mean squared log error. Return type: float Examples
>>> m = model.LinearRegression() >>> m.mean_sq_log_error()

median_abs_error
(**kwargs)¶ Median absolute error.
Returns: Median absolute error. Return type: float Examples
>>> m = model.LinearRegression() >>> m.median_abs_error()

metrics
(*metrics)¶ Measures how well your model performed against certain metrics.
If a project metrics has been specified, it will display those metrics, otherwise it will display the specified metrics or all metrics.
For more detailed information and parameters please see the following link: https://scikitlearn.org/stable/modules/classes.html#regressionmetrics
 Supported metrics are:
‘Explained Variance’: ‘Explained variance regression score function. Best possible score is 1.0, lower values are worse.’,
‘Max Error’: ‘Returns the single most maximum residual error.’,
‘Mean Absolute Error’: ‘Postive mean value of all residuals’,
‘Mean Squared Error’: ‘Mean of the squared sum the residuals’,
‘Root Mean Sqaured Error’: ‘Square root of the Mean Squared Error’,
‘Mean Squared Log Error’: ‘Mean of the squared sum of the log of all residuals’,
‘Median Absolute Error’: ‘Postive median value of all residuals’,
‘R2’: ‘Rsquared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that is explained by an independent variable or variables in a regression model.’,
‘SMAPE’: ‘Symmetric mean absolute percentage error. It is an accuracy measure based on percentage (or relative) errors.’
Parameters: metrics (str(s), optional) – Specific type of metrics to view Examples
>>> m = model.LinearRegression() >>> m.metrics() >>> m.metrics('SMAPE', 'Root Mean Squared Error')

plot_predicted_actual
(output_file='', **scatterplot_kwargs)¶ Plots the actual data vs. predictions
Parameters: output_file (str, optional) – Output file name, by default “”

r2
(**kwargs)¶ R^2 (coefficient of determination) regression score function.
Rsquared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that is explained by an independent variable or variables in a regression model.
Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.
Returns: R2 coefficient. Return type: float Examples
>>> m = model.LinearRegression() >>> m.r2()

root_mean_sq_error
()¶ Root mean squared error.
Calculated by taking the square root of the Mean Squared Error.
Returns: Root mean squared error. Return type: float Examples
>>> m = model.LinearRegression() >>> m.root_mean_sq_error()

smape
(**kwargs)¶ Symmetric mean absolute percentage error.
It is an accuracy measure based on percentage (or relative) errors.
Returns: SMAPE Return type: float Examples
>>> m = model.LinearRegression() >>> m.smape()


class
aethos.model_analysis.unsupervised_model_analysis.
UnsupervisedModelAnalysis
(model, data, model_name)¶ Bases:
aethos.model_analysis.model_analysis.ModelAnalysisBase

filter_cluster
(cluster_no: int)¶ Filters data by a cluster number for analysis.
Parameters: cluster_no (int) – Cluster number to filter by Returns: Filtered data or test dataframe Return type: Dataframe Examples
>>> m = model.KMeans() >>> m.filter_cluster(1)

plot_clusters
(dim=2, reduce='pca', output_file='', **kwargs)¶ Plots the clusters in either 2d or 3d space with each cluster point highlighted as a different colour.
For 2d plotting options, see:
For 3d plotting options, see:
Parameters:  dim (2 or 3, optional) – Dimension of the plot, either 2 for 2d, 3 for 3d, by default 2
 reduce (str {'pca', 'tvsd', 'lle', 'tsne'}, optional) – Dimension reduction strategy i.e. pca, by default “pca”
 output_file (str) – Output file name including extension (.png, .jpg, etc.) to save image as.
Examples
>>> m = model.KMeans() >>> m.plot_clusters() >>> m.plot_clusters(dim=3)


class
aethos.model_analysis.text_model_analysis.
TextModelAnalysis
(model, data, model_name, **kwargs)¶ Bases:
aethos.model_analysis.model_analysis.ModelAnalysisBase

coherence_score
(col_name)¶ Displays the coherence score of the topic model.
For more info on topic coherence: https://raretechnologies.com/whatistopiccoherence/
Parameters: col_name (str) – Column name that was used as input for the LDA model Examples
>>> m = model.LDA() >>> m.coherence_score()

model_perplexity
()¶ Displays the model perplexity of the topic model.
Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. It may be used to compare probability models.
A low perplexity indicates the probability distribution is good at predicting the sample.
Examples
>>> m = model.LDA() >>> m.model_perplexity()

view
(original_text, model_output)¶ View the original text and the model output in a more user friendly format
Parameters:  original_text (str) – Column name of the original text
 model_output (str) – Column name of the model text
Examples
>>> m = model.LDA() >>> m.view('original_text_col_name', 'model_output_col_name')

view_topic
(topic_num: int, **kwargs)¶ View a specific topic from topic modelling model.
Parameters: topic_num (int) – Returns: String representation of topic and probabilities Return type: str Examples
>>> m = model.LDA() >>> m.view_topic(1)

view_topics
(num_topics=10, **kwargs)¶ View topics from topic modelling model.
Parameters: num_topics (int, optional) – Number of topics to view, by default 10 Returns: String representation of topics and probabilities Return type: str Examples
>>> m = model.LDA() >>> m.view_topics()

visualize_topics
(**kwargs)¶ Visualize topics using pyLDAvis.
Parameters:  R (int) – The number of terms to display in the barcharts of the visualization. Default is 30. Recommended to be roughly between 10 and 50.
 lambda_step (float, between 0 and 1) – Determines the interstep distance in the grid of lambda values over which to iterate when computing relevance. Default is 0.01. Recommended to be between 0.01 and 0.1.
 mds (function or {'tsne', 'mmds}) – A function that takes topic_term_dists as an input and outputs a n_topics by 2 distance matrix. The output approximates the distance between topics. See js_PCoA() for details on the default function. A string representation currently accepts pcoa (or upper case variant), mmds (or upper case variant) and tsne (or upper case variant), if sklearn package is installed for the latter two.
 n_jobs (int) – The number of cores to be used to do the computations. The regular joblib conventions are followed so 1, which is the default, will use all cores.
 plot_opts (dict, with keys ‘xlab’ and ylab) – Dictionary of plotting options, right now only used for the axis labels.
 sort_topics (bool) – Sort topics by topic proportion (percentage of tokens covered). Set to false to keep original topic order.
Examples
>>> m = model.LDA() >>> m.visualize_topics()
