Preprocessing package¶
Module contents¶
-
class
aethos.preprocessing.
Preprocess
¶ Bases:
object
-
clean_text
(*list_args, list_of_cols=[], lower=True, punctuation=True, stopwords=True, stemmer=True, numbers=True, new_col_name='_clean')¶ Function that takes text and does the following:
- Casts it to lowercase
- Removes punctuation
- Removes stopwords
- Stems the text
- Removes any numerical text
Parameters: - list_args (str(s), optional) – Specific columns to apply this technique to.
- list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
- lower (bool, optional) – True to cast all text to lowercase, by default True
- punctuation (bool, optional) – True to remove punctuation, by default True
- stopwords (bool, optional) – True to remove stop words, by default True
- stemmer (bool, optional) – True to stem the data, by default True
- numbers (bool, optional) – True to remove numerical data, by default True
- new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_clean
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.clean_text('col1') >>> data.clean_text(['col1', 'col2'], lower=False) >>> data.clean_text(lower=False, stopwords=False, stemmer=False)
-
normalize_log
(*list_args, list_of_cols=[], base=1)¶ Scales data logarithmically.
Options are 1 for natural log, 2 for base2, 10 for base10.
Parameters: - list_args (str(s), optional) – Specific columns to apply this technique to.
- list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
- base (str, optional) – Base to logarithmically scale by, by default ‘’
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.normalize_log('col1') >>> data.normalize_log(['col1', 'col2'], base=10)
-
normalize_numeric
(*list_args, list_of_cols=[], **normalize_params)¶ Function that normalizes all numeric values between 2 values to bring features into same domain.
If list_of_cols is not provided, the strategy will be applied to all numeric columns.
If a list of columns is provided use the list, otherwise use arguments.
For more info please see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler
Parameters: - list_args (str(s), optional) – Specific columns to apply this technique to.
- list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
- feature_range (tuple(int or float, int or float), optional) – Min and max range to normalize values to, by default (0, 1)
- normalize_params (dict, optional) – Parmaters to pass into MinMaxScaler() constructor from Scikit-Learn
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.normalize_numeric('col1') >>> data.normalize_numeric(['col1', 'col2'])
-
normalize_quantile_range
(*list_args, list_of_cols=[], **robust_params)¶ Scale features using statistics that are robust to outliers.
This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.
If list_of_cols is not provided, the strategy will be applied to all numeric columns.
If a list of columns is provided use the list, otherwise use arguments.
For more info please see: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler
Parameters: - list_args (str(s), optional) – Specific columns to apply this technique to.
- list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
- with_centering (boolean, True by default) – If True, center the data before scaling. This will cause transform to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
- with_scaling (boolean, True by default) – If True, scale the data to interquartile range.
- quantile_range (tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0) – Default: (25.0, 75.0) = (1st quantile, 3rd quantile) = IQR Quantile range used to calculate scale_.
- robust_params (dict, optional) – Parmaters to pass into MinMaxScaler() constructor from Scikit-Learn
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.normalize_quantile_range('col1') >>> data.normalize_quantile_range(['col1', 'col2'])
-
remove_numbers
(*list_args, list_of_cols=[], new_col_name='_rem_num')¶ Removes numbers from text in a column.
Parameters: - list_args (str(s), optional) – Specific columns to apply this technique to.
- list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
- new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_num
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.remove_numbers('col1', new_col_name="text_wo_num)
-
remove_punctuation
(*list_args, list_of_cols=[], regexp='', exceptions=[], new_col_name='_rem_punct')¶ Removes punctuation from every string entry.
Defaults to removing all punctuation, but if regex of punctuation is provided, it will remove them.
An example regex would be:
(w+.|w+)[^,] - Include all words and words with periods after them but don’t include commas. (w+.)|(w+), would also achieve the same result
Parameters: - list_args (str(s), optional) – Specific columns to apply this technique to.
- list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
- regexp (str, optional) – Regex expression used to define what to include.
- exceptions (list, optional) – List of punctuation to include in the text, by default []
- new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_punct
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.remove_punctuation('col1') >>> data.remove_punctuation(['col1', 'col2']) >>> data.remove_punctuation('col1', regexp=r'(\w+\.)|(\w+)') # Include all words and words with periods after.
-
remove_stopwords_nltk
(*list_args, list_of_cols=[], custom_stopwords=[], new_col_name='_rem_stop')¶ Removes stopwords following the nltk English stopwords list.
A list of custom words can be provided as well, usually for domain specific words.
Stop words are generally the most common words in a language
Parameters: - list_args (str(s), optional) – Specific columns to apply this technique to.
- list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
- custom_stop_words (list, optional) – Custom list of words to also drop with the stop words, must be LOWERCASE, by default []
- new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_rem_stop
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.remove_stopwords_nltk('col1') >>> data.remove_stopwords_nltk(['col1', 'col2'])
-
split_sentences
(*list_args, list_of_cols=[], new_col_name='_sentences')¶ Splits text data into sentences and saves it into another column for analysis.
If a list of columns is provided use the list, otherwise use arguments.
Parameters: - list_args (str(s), optional) – Specific columns to apply this technique to.
- list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
- new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_sentences
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.split_sentences('col1') >>> data.split_sentences(['col1', 'col2'])
-
split_words_nltk
(*list_args, list_of_cols=[], regexp='', new_col_name='_tokenized')¶ Splits text into its words using nltk punkt tokenizer by default.
Default is by spaces and punctuation but if a regex expression is provided, it will use that.
Parameters: - list_args (str(s), optional) – Specific columns to apply this technique to.
- list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
- regexp (str, optional) – Regex expression used to define what a word is.
- new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_tokenized
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.split_words_nltk('col1') >>> data.split_words_nltk(['col1', 'col2'])
-
stem_nltk
(*list_args, list_of_cols=[], stemmer='porter', new_col_name='_stemmed')¶ Transforms text to their word stem, base or root form. For example:
dogs –> dog churches –> church abaci –> abacusParameters: - list_args (str(s), optional) – Specific columns to apply this technique to.
- list_of_cols (list, optional) – A list of specific columns to apply this technique to., by default []
- stemmer (str, optional) –
Type of NLTK stemmer to use, by default porter
- Current stemming implementations:
- porter
- snowball
For more information please refer to the NLTK stemming api https://www.nltk.org/api/nltk.stem.html
- new_col_name (str, optional) – New column name to be created when applying this technique, by default COLUMN_stemmed
Returns: Returns a deep copy of the Data object.
Return type: Data
Examples
>>> data.stem_nltk('col1') >>> data.stem_nltk(['col1', 'col2'], stemmer='snowball')
-