Preprocessing package¶

Module contents¶

class aethos.preprocessing.Preprocess¶

Bases: object

clean_text(*list_args, list_of_cols=[], lower=True, punctuation=True, stopwords=True, stemmer=True, numbers=True, new_col_name='_clean')¶

Function that takes text and does the following:

Casts it to lowercase
Removes punctuation
Removes stopwords
Stems the text
Removes any numerical text

Parameters:	list_args (str(s), optional) – Specific columns to apply this technique to. list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default [] lower (bool, optional) – True to cast all text to lowercase, by default True punctuation (bool, optional) – True to remove punctuation, by default True stopwords (bool, optional) – True to remove stop words, by default True stemmer (bool, optional) – True to stem the data, by default True numbers (bool, optional) – True to remove numerical data, by default True new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_clean
Returns:	Returns a deep copy of the Data object.
Return type:	Data

Examples

>>> data.clean_text('col1')
>>> data.clean_text(['col1', 'col2'], lower=False)
>>> data.clean_text(lower=False, stopwords=False, stemmer=False)

normalize_log(*list_args, list_of_cols=[], base=1)¶

Scales data logarithmically.

Options are 1 for natural log, 2 for base2, 10 for base10.

Parameters:	list_args (str(s), optional) – Specific columns to apply this technique to. list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default [] base (str, optional) – Base to logarithmically scale by, by default ‘’
Returns:	Returns a deep copy of the Data object.
Return type:	Data

Examples

>>> data.normalize_log('col1')
>>> data.normalize_log(['col1', 'col2'], base=10)

normalize_numeric(*list_args, list_of_cols=[], **normalize_params)¶

Function that normalizes all numeric values between 2 values to bring features into same domain.

If list_of_cols is not provided, the strategy will be applied to all numeric columns.

If a list of columns is provided use the list, otherwise use arguments.

For more info please see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler

Parameters:	list_args (str(s), optional) – Specific columns to apply this technique to. list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default [] feature_range (tuple(int or float, int or float), optional) – Min and max range to normalize values to, by default (0, 1) normalize_params (dict, optional) – Parmaters to pass into MinMaxScaler() constructor from Scikit-Learn
Returns:	Returns a deep copy of the Data object.
Return type:	Data

Examples

>>> data.normalize_numeric('col1')
>>> data.normalize_numeric(['col1', 'col2'])

normalize_quantile_range(*list_args, list_of_cols=[], **robust_params)¶

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.

If list_of_cols is not provided, the strategy will be applied to all numeric columns.

If a list of columns is provided use the list, otherwise use arguments.

For more info please see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler

Parameters:	list_args (str(s), optional) – Specific columns to apply this technique to. list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default [] with_centering (boolean, True by default) – If True, center the data before scaling. This will cause transform to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory. with_scaling (boolean, True by default) – If True, scale the data to interquartile range. quantile_range (tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0) – Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR Quantile range used to calculate scale_. robust_params (dict, optional) – Parmaters to pass into MinMaxScaler() constructor from Scikit-Learn
Returns:	Returns a deep copy of the Data object.
Return type:	Data

Examples

>>> data.normalize_quantile_range('col1')
>>> data.normalize_quantile_range(['col1', 'col2'])

remove_numbers(*list_args, list_of_cols=[], new_col_name='_rem_num')¶

Removes numbers from text in a column.

Parameters:	list_args (str(s), optional) – Specific columns to apply this technique to. list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default [] new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_num
Returns:	Returns a deep copy of the Data object.
Return type:	Data

Examples

>>> data.remove_numbers('col1', new_col_name="text_wo_num)

remove_punctuation(*list_args, list_of_cols=[], regexp='', exceptions=[], new_col_name='_rem_punct')¶

Removes punctuation from every string entry.

Defaults to removing all punctuation, but if regex of punctuation is provided, it will remove them.

An example regex would be:

(w+.|w+)[^,] - Include all words and words with periods after them but don’t include commas. (w+.)|(w+), would also achieve the same result

Parameters:	list_args (str(s), optional) – Specific columns to apply this technique to. list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default [] regexp (str, optional) – Regex expression used to define what to include. exceptions (list, optional) – List of punctuation to include in the text, by default [] new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_punct
Returns:	Returns a deep copy of the Data object.
Return type:	Data

Examples

>>> data.remove_punctuation('col1')
>>> data.remove_punctuation(['col1', 'col2'])
>>> data.remove_punctuation('col1', regexp=r'(\w+\.)|(\w+)') # Include all words and words with periods after.

remove_stopwords_nltk(*list_args, list_of_cols=[], custom_stopwords=[], new_col_name='_rem_stop')¶

Removes stopwords following the nltk English stopwords list.

A list of custom words can be provided as well, usually for domain specific words.

Stop words are generally the most common words in a language

Parameters:	list_args (str(s), optional) – Specific columns to apply this technique to. list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default [] custom_stop_words (list, optional) – Custom list of words to also drop with the stop words, must be LOWERCASE, by default [] new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_stop
Returns:	Returns a deep copy of the Data object.
Return type:	Data

Examples

>>> data.remove_stopwords_nltk('col1')
>>> data.remove_stopwords_nltk(['col1', 'col2'])

split_sentences(*list_args, list_of_cols=[], new_col_name='_sentences')¶

Splits text data into sentences and saves it into another column for analysis.

If a list of columns is provided use the list, otherwise use arguments.

Parameters:	list_args (str(s), optional) – Specific columns to apply this technique to. list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default [] new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_sentences
Returns:	Returns a deep copy of the Data object.
Return type:	Data

Examples

>>> data.split_sentences('col1')
>>> data.split_sentences(['col1', 'col2'])

split_words_nltk(*list_args, list_of_cols=[], regexp='', new_col_name='_tokenized')¶

Splits text into its words using nltk punkt tokenizer by default.

Default is by spaces and punctuation but if a regex expression is provided, it will use that.

Parameters:	list_args (str(s), optional) – Specific columns to apply this technique to. list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default [] regexp (str, optional) – Regex expression used to define what a word is. new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_tokenized
Returns:	Returns a deep copy of the Data object.
Return type:	Data

Examples

>>> data.split_words_nltk('col1')
>>> data.split_words_nltk(['col1', 'col2'])

stem_nltk(*list_args, list_of_cols=[], stemmer='porter', new_col_name='_stemmed')¶

Transforms text to their word stem, base or root form. For example:

dogs –> dog churches –> church abaci –> abacus

Parameters:	list_args (str(s), optional) – Specific columns to apply this technique to. list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default [] stemmer (str, optional) – Type of NLTK stemmer to use, by default porter Current stemming implementations: porter snowball For more information please refer to the NLTK stemming api https://www.nltk.org/api/nltk.stem.html new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_stemmed
Returns:	Returns a deep copy of the Data object.
Return type:	Data

Examples

>>> data.stem_nltk('col1')
>>> data.stem_nltk(['col1', 'col2'], stemmer='snowball')