Stats package

Module contents

class aethos.stats.stats.Stats

Bases: object

anova(dep_var: str, num_variables=[], cat_variables=[], formula=None, verbose=False)

Runs an anova.

Anovas are to be used when one wants to compare the means of a condition between 2+ groups.

ANOVA tests if there is a difference in the mean somewhere in the model (testing if there was an overall effect), but it does not tell one where the difference is if the there is one.

Parameters:
  • dep_var (str) – Dependent variable you want to explore the relationship of
  • num_variables (list, optional) – Numeric variable columns, by default []
  • cat_variables (list, optional) – Categorical variable columns, by default []
  • formula (str, optional) – OLS formula statsmodel lib, by default None
  • verbose (bool, optional) – True to print OLS model summary and formula, by default False

Examples

>>> data.anova('dep_col', num_variables=['col1', 'col2'], verbose=True)
>>> data.anova('dep_col', cat_variables=['col1', 'col2'], verbose=True)
>>> data.anova('dep_col', num_variables=['col1', 'col2'], cat_variables=['col3'] verbose=True)
ind_ttest(group1: str, group2: str, equal_var=True, output_file=None)

Performs an Independent T test.

This is to be used when you want to compare the means of 2 groups.

If group 2 column name is not provided and there is a test set, it will compare the same column in the train and test set.

If there are any NaN’s they will be omitted.

Parameters:
  • group1 (str) – Column for group 1 to compare.
  • group2 (str, optional) – Column for group 2 to compare, by default None
  • equal_var (bool, optional) – If True (default), perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variance, by default True
  • output_file (str, optional) – Name of the file to output, by default None
Returns:

T test statistic, P value

Return type:

list

Examples

>>> data.ind_ttest('col1', 'col2')
>>> data.ind_ttest('col1', 'col2', output_file='ind_ttest.png')
ks_feature_distribution(threshold=0.1, show_plots=True)

Uses the Kolomogorov-Smirnov test see if the distribution in the training and test sets are similar.

Credit: https://www.kaggle.com/nanomathias/distribution-of-test-vs-training-data#1.-t-SNE-Distribution-Overview

Parameters:
  • threshold (float, optional) – KS statistic threshold, by default 0.1
  • show_plots (bool, optional) – True to show histograms of feature distributions, by default True
Returns:

Columns that are significantly different in the train and test set.

Return type:

DataFrame

Examples

>>> data.ks_feature_distribution()
>>> data.ks_feature_distribution(threshold=0.2)
most_common(col: str, n=15, plot=False, use_test=False, output_file='', **plot_kwargs)

Analyzes the most common values in the column and either prints them or displays a bar chart.

Parameters:
  • col (str) – Column to analyze
  • n (int, optional) – Number of top most common values to display, by default 15
  • plot (bool, optional) – True to plot a bar chart, by default False
  • use_test (bool, optional) – True to analyze the test set, by default False
  • output_file (str,) – File name to save plot as, IF plot=True

Examples

>>> data.most_common('col1', plot=True)
>>> data.most_common('col1', n=50, plot=True)
>>> data.most_common('col1', n=50)
onesample_ttest(group1: str, mean: Union[float, int], output_file=None)

Performs a One Sample t-test.

This is to be used when you want to compare the mean of a single group against a known mean.

If there are any NaN’s they will be omitted.

Parameters:
  • group1 (str) – Column for group 1 to compare.
  • mean (float, int, optional) – Sample mean to compare to.
  • output_file (str, optional) – Name of the file to output, by default None
Returns:

T test statistic, P value

Return type:

list

Examples

>>> data.onesample_ttest('col1', 1)
>>> data.onesample_ttest('col1', 1, output_file='ones_ttest.png')
paired_ttest(group1: str, group2=None, output_file=None)

Performs a Paired t-test.

This is to be used when you want to compare the means from the same group at different times.

If group 2 column name is not provided and there is a test set, it will compare the same column in the train and test set.

If there are any NaN’s they will be omitted.

Parameters:
  • group1 (str) – Column for group 1 to compare.
  • group2 (str, optional) – Column for group 2 to compare, by default None
  • equal_var (bool, optional) – If True (default), perform a standard independent 2 sample test that assumes equal population variances. If False, perform Welch’s t-test, which does not assume equal population variance, by default True
  • output_file (str, optional) – Name of the file to output, by default None
Returns:

T test statistic, P value

Return type:

list

Examples

>>> data.paired_ttest('col1', 'col2')
>>> data.paired_ttest('col1', 'col2', output_file='pair_ttest.png')
predict_data_sample()

Identifies how similar the train and test set distribution are by trying to predict whether each sample belongs to the train or test set using Random Forest, 10 Fold Stratified Cross Validation.

The lower the F1 score, the more similar the distributions are as it’s harder to predict which sample belongs to which distribution.

Credit: https://www.kaggle.com/nanomathias/distribution-of-test-vs-training-data#1.-t-SNE-Distribution-Overview

Returns:Returns a deep copy of the Data object.
Return type:Data

Examples

>>> data.predict_data_sample()