Preprocessing package

Module contents

class aethos.preprocessing.Preprocess

Bases: object

clean_text(*list_args, list_of_cols=[], lower=True, punctuation=True, stopwords=True, stemmer=True, numbers=True, new_col_name='_clean')

Function that takes text and does the following:

  • Casts it to lowercase
  • Removes punctuation
  • Removes stopwords
  • Stems the text
  • Removes any numerical text
Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • lower (bool, optional) – True to cast all text to lowercase, by default True
  • punctuation (bool, optional) – True to remove punctuation, by default True
  • stopwords (bool, optional) – True to remove stop words, by default True
  • stemmer (bool, optional) – True to stem the data, by default True
  • numbers (bool, optional) – True to remove numerical data, by default True
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_clean
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.clean_text('col1')
>>> data.clean_text(['col1', 'col2'], lower=False)
>>> data.clean_text(lower=False, stopwords=False, stemmer=False)
normalize_log(*list_args, list_of_cols=[], base=1)

Scales data logarithmically.

Options are 1 for natural log, 2 for base2, 10 for base10.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • base (str, optional) – Base to logarithmically scale by, by default ‘’
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.normalize_log('col1')
>>> data.normalize_log(['col1', 'col2'], base=10)
normalize_numeric(*list_args, list_of_cols=[], **normalize_params)

Function that normalizes all numeric values between 2 values to bring features into same domain.

If list_of_cols is not provided, the strategy will be applied to all numeric columns.

If a list of columns is provided use the list, otherwise use arguments.

For more info please see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • feature_range (tuple(int or float, int or float), optional) – Min and max range to normalize values to, by default (0, 1)
  • normalize_params (dict, optional) – Parmaters to pass into MinMaxScaler() constructor from Scikit-Learn
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.normalize_numeric('col1')
>>> data.normalize_numeric(['col1', 'col2'])
normalize_quantile_range(*list_args, list_of_cols=[], **robust_params)

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.

If list_of_cols is not provided, the strategy will be applied to all numeric columns.

If a list of columns is provided use the list, otherwise use arguments.

For more info please see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • with_centering (boolean, True by default) – If True, center the data before scaling. This will cause transform to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
  • with_scaling (boolean, True by default) – If True, scale the data to interquartile range.
  • quantile_range (tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0) – Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR Quantile range used to calculate scale_.
  • robust_params (dict, optional) – Parmaters to pass into MinMaxScaler() constructor from Scikit-Learn
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.normalize_quantile_range('col1')
>>> data.normalize_quantile_range(['col1', 'col2'])
remove_numbers(*list_args, list_of_cols=[], new_col_name='_rem_num')

Removes numbers from text in a column.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_num
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.remove_numbers('col1', new_col_name="text_wo_num)
remove_punctuation(*list_args, list_of_cols=[], regexp='', exceptions=[], new_col_name='_rem_punct')

Removes punctuation from every string entry.

Defaults to removing all punctuation, but if regex of punctuation is provided, it will remove them.

An example regex would be:

(w+.|w+)[^,] - Include all words and words with periods after them but don’t include commas. (w+.)|(w+), would also achieve the same result

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • regexp (str, optional) – Regex expression used to define what to include.
  • exceptions (list, optional) – List of punctuation to include in the text, by default []
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_punct
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.remove_punctuation('col1')
>>> data.remove_punctuation(['col1', 'col2'])
>>> data.remove_punctuation('col1', regexp=r'(\w+\.)|(\w+)') # Include all words and words with periods after.
remove_stopwords_nltk(*list_args, list_of_cols=[], custom_stopwords=[], new_col_name='_rem_stop')

Removes stopwords following the nltk English stopwords list.

A list of custom words can be provided as well, usually for domain specific words.

Stop words are generally the most common words in a language

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • custom_stop_words (list, optional) – Custom list of words to also drop with the stop words, must be LOWERCASE, by default []
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_stop
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.remove_stopwords_nltk('col1')
>>> data.remove_stopwords_nltk(['col1', 'col2'])
split_sentences(*list_args, list_of_cols=[], new_col_name='_sentences')

Splits text data into sentences and saves it into another column for analysis.

If a list of columns is provided use the list, otherwise use arguments.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_sentences
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.split_sentences('col1')
>>> data.split_sentences(['col1', 'col2'])
split_words_nltk(*list_args, list_of_cols=[], regexp='', new_col_name='_tokenized')

Splits text into its words using nltk punkt tokenizer by default.

Default is by spaces and punctuation but if a regex expression is provided, it will use that.

Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • regexp (str, optional) – Regex expression used to define what a word is.
  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_tokenized
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.split_words_nltk('col1')
>>> data.split_words_nltk(['col1', 'col2'])
stem_nltk(*list_args, list_of_cols=[], stemmer='porter', new_col_name='_stemmed')

Transforms text to their word stem, base or root form. For example:

dogs –> dog churches –> church abaci –> abacus
Parameters:
  • list_args (str(s), optional) – Specific columns to apply this technique to.
  • list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
  • stemmer (str, optional) –

    Type of NLTK stemmer to use, by default porter

    Current stemming implementations:
    • porter
    • snowball

    For more information please refer to the NLTK stemming api https://www.nltk.org/api/nltk.stem.html

  • new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_stemmed
Returns:

Returns a deep copy of the Data object.

Return type:

Data

Examples

>>> data.stem_nltk('col1')
>>> data.stem_nltk(['col1', 'col2'], stemmer='snowball')