Text Mining¶

Overview¶

For each step, read the explanation, then run the code cell(s) right below it.

You will practice how to:

Perform core text preprocessing steps including punctuation removal, stop word filtering, and stemming
Compare different stop word sources (NLTK vs. sklearn) and understand how they impact results
Apply stemming techniques and review their limitations
Create a Bag-Of-Words count matrix using CountVectorizer and TF-IDF matrix using TfidfTransformer
Build a custom preprocessing pipeline as a tokenizer class for text preprocessing
Apply dimensionality reduction (Truncated SVD) to extract meaningful latent features
Train a model for text classification and evaluate predictions

Import libraries¶

In [1]:

Copied!





import sys
import os
from pathlib import Path
import zipfile
import re
import pandas as pd
import numpy as np
from zipfile import ZipFile
import nltk
from nltk.stem.snowball import SnowballStemmer, EnglishStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS
from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.linear_model import LogisticRegression

SEED = 42
import sys
import os
from pathlib import Path
import zipfile
import re
import pandas as pd
import numpy as np
from zipfile import ZipFile
import nltk
from nltk.stem.snowball import SnowballStemmer, EnglishStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS
from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.linear_model import LogisticRegression

SEED = 42

In [2]:

Copied!

#nltk.download('punkt_tab')
#nltk.download('stopwords')
#nltk.download('punkt_tab')
#nltk.download('stopwords')

In [3]:

Copied!





# Import local libraries
root_dir = Path.cwd().resolve().parents[0]
sys.path.append(str(root_dir))

# Visualization functions
from src.visuals.make_plots import *

# Helper functions
from src.utils.helpers import *

# Load the "autoreload" extension so that code can change
%load_ext autoreload
#%reload_ext autoreload

# Always reload modules so that as you change code in src, it gets loaded
%autoreload 2
# Import local libraries
root_dir = Path.cwd().resolve().parents[0]
sys.path.append(str(root_dir))

# Visualization functions
from src.visuals.make_plots import *

# Helper functions
from src.utils.helpers import *

# Load the "autoreload" extension so that code can change
%load_ext autoreload
#%reload_ext autoreload

# Always reload modules so that as you change code in src, it gets loaded
%autoreload 2

Text Preprocessing Examples¶

In the following cells, we will demonstrate how to do some simple manual text preprocessing using string operations and the nltk library:

Remove punctuation
Remove stop words
Stem words

Remove punctuation from sample text

In [4]:

Copied!

s = 'The technician was resolving technical issues quickly.'
words = [word for word in word_tokenize(s) if word not in punctuation]
words
s = 'The technician was resolving technical issues quickly.'
words = [word for word in word_tokenize(s) if word not in punctuation]
words

Out[4]:

['The', 'technician', 'was', 'resolving', 'technical', 'issues', 'quickly']

In this simple example, we removed all punctucation from the text. However, removing punctuation might be a problem in the text mining real-world applications. Punctuation can carry important meaning and context that may improve model performance:

Emphasis & sentiment
- "This is amazing!!!" vs "This is amazing"
- The repeated !!! signals stronger emotion or excitement
Tone & intent
- "Really?" vs "Really"
- The question mark can indicate skepticism or confusion
Sarcasm or urgency
- "Oh great..." vs "Oh great"
- Ellipses (...) can imply sarcasm or hesitation
- "Stop!!!" may indicate urgency or intensity
Stylistic patterns
- Excessive punctuation (e.g., !!!, ???, ?!) can be strong signals in tasks like spam detection, sentiment analysis, or social media classification

✅ When it does make sense to remove punctuation

When punctuation does not add meaningful signal for your task
When you want to reduce noise and simplify the feature space
For models that rely on word frequency alone (e.g., Bag-of-Words without additional feature engineering)

💡 Practical takeaway

Instead of blindly removing punctuation, consider:

Keeping punctuation as-is
Converting punctuation into engineered features (e.g., count of !, presence of ?)
Using tokenizers that preserve punctuation when it is meaningful

Stop word sources

Let's review some sample output from two of the most common stop word sources:

NLTK stopwords

Comes from the nltk.corpus.stopwords list
One of the oldest and most widely used stop word lists
Designed for general-purpose NLP tasks

scikit-learn stopwords

Available via sklearn.feature_extraction.text.ENGLISH_STOP_WORDS
Built specifically for machine learning workflows (e.g., TF-IDF, vectorization)
More tightly integrated into sklearn pipelines

In [5]:

Copied!





def preview_stop_words(stop_words, ncolumns, nrows):
    print(f'First {ncolumns * nrows} of {len(stop_words)} stopwords')
    for i in range(0, len(stop_words[:(ncolumns * nrows)]), ncolumns):
        print(''.join(word.ljust(13) for word in stop_words[i:(i+ncolumns)]))
def preview_stop_words(stop_words, ncolumns, nrows):
    print(f'First {ncolumns * nrows} of {len(stop_words)} stopwords')
    for i in range(0, len(stop_words[:(ncolumns * nrows)]), ncolumns):
        print(''.join(word.ljust(13) for word in stop_words[i:(i+ncolumns)]))

NLTK stopwords usage

In [6]:

Copied!

nltk_stop_words = sorted(stopwords.words('english'))

preview_stop_words(nltk_stop_words, 6, 30)
nltk_stop_words = sorted(stopwords.words('english'))

preview_stop_words(nltk_stop_words, 6, 30)

First 180 of 198 stopwords
a            about        above        after        again        against      
ain          all          am           an           and          any          
are          aren         aren't       as           at           be           
because      been         before       being        below        between      
both         but          by           can          couldn       couldn't     
d            did          didn         didn't       do           does         
doesn        doesn't      doing        don          don't        down         
during       each         few          for          from         further      
had          hadn         hadn't       has          hasn         hasn't       
have         haven        haven't      having       he           he'd         
he'll        he's         her          here         hers         herself      
him          himself      his          how          i            i'd          
i'll         i'm          i've         if           in           into         
is           isn          isn't        it           it'd         it'll        
it's         its          itself       just         ll           m            
ma           me           mightn       mightn't     more         most         
mustn        mustn't      my           myself       needn        needn't      
no           nor          not          now          o            of           
off          on           once         only         or           other        
our          ours         ourselves    out          over         own          
re           s            same         shan         shan't       she          
she'd        she'll       she's        should       should've    shouldn      
shouldn't    so           some         such         t            than         
that         that'll      the          their        theirs       them         
themselves   then         there        these        they         they'd       
they'll      they're      they've      this         those        through      
to           too          under        until        up           ve           
very         was          wasn         wasn't       we           we'd         
we'll        we're        we've        were         weren        weren't      
what         when         where        which        while        who

sklearn.feature_extraction._stop_words usage

In [7]:

Copied!

sk_stop_words = sorted(ENGLISH_STOP_WORDS)

preview_stop_words(sk_stop_words, 6, 30)
sk_stop_words = sorted(ENGLISH_STOP_WORDS)

preview_stop_words(sk_stop_words, 6, 30)

First 180 of 318 stopwords
a            about        above        across       after        afterwards   
again        against      all          almost       alone        along        
already      also         although     always       am           among        
amongst      amoungst     amount       an           and          another      
any          anyhow       anyone       anything     anyway       anywhere     
are          around       as           at           back         be           
became       because      become       becomes      becoming     been         
before       beforehand   behind       being        below        beside       
besides      between      beyond       bill         both         bottom       
but          by           call         can          cannot       cant         
co           con          could        couldnt      cry          de           
describe     detail       do           done         down         due          
during       each         eg           eight        either       eleven       
else         elsewhere    empty        enough       etc          even         
ever         every        everyone     everything   everywhere   except       
few          fifteen      fifty        fill         find         fire         
first        five         for          former       formerly     forty        
found        four         from         front        full         further      
get          give         go           had          has          hasnt        
have         he           hence        her          here         hereafter    
hereby       herein       hereupon     hers         herself      him          
himself      his          how          however      hundred      i            
ie           if           in           inc          indeed       interest     
into         is           it           its          itself       keep         
last         latter       latterly     least        less         ltd          
made         many         may          me           meanwhile    might        
mill         mine         more         moreover     most         mostly       
move         much         must         my           myself       name         
namely       neither      never        nevertheless next         nine         
no           nobody       none         noone        nor          not

Remove stop words from sample text

In [8]:

Copied!

words = [word.lower() for word in words if word.lower() not in nltk_stop_words]
words
words = [word.lower() for word in words if word.lower() not in nltk_stop_words]
words

Out[8]:

['technician', 'resolving', 'technical', 'issues', 'quickly']

In this example, we removed common stop words such as the and was. While this is a standard preprocessing step, it is not always the right choice for every machine learning task.

Stop words can carry important meaning, especially when they affect the structure or intent of a sentence:

Negation matters (critical for sentiment)
- "This is not good" → removing stop words → "good"
- This completely flips the meaning of the sentence
Subtle differences in intent
- "I do like this" vs "I like this"
- Words like do can add emphasis or nuance
Phrase-level meaning
- "to be or not to be" → removing stop words destroys the structure and meaning
- Many common phrases rely heavily on stop words
Context in short text
- In tweets, reviews, or short messages, stop words may make up a large portion of the signal
- Removing them can leave too little information

✅ When it does make sense to remove stop words

When working with large documents where stop words add little distinguishing value
In topic modeling or document classification where content words carry most of the signal
When using simple models like Bag-of-Words or TF-IDF to reduce dimensionality

💡 Practical takeaway

Instead of automatically removing stop words, consider:

Keeping all words and letting the model learn what matters
Using custom stop word lists (e.g., keep words like not, no, never)
Evaluating model performance with and without stop word removal

Stem words from sample text

In [9]:

Copied!

stemmer = SnowballStemmer("english")
 
word_stems = [stemmer.stem(word) for word in words]
word_stems
stemmer = SnowballStemmer("english")
 
word_stems = [stemmer.stem(word) for word in words]
word_stems

Out[9]:

['technician', 'resolv', 'technic', 'issu', 'quick']

In this example, we stemmed the words using the English language SnowballStemmer. This applied stemming to reduce words to their root form (e.g., resolving → resolv, technical → technic, issues → issu). While this helps consolidate similar words, it also introduces some important trade-offs.

Stemming algorithms are rule-based and not tied to a standard English dictionary, which can lead to unexpected or inconsistent outputs:

Not real words
- "issues" → "issu"
- "better" → "better" (unchanged depending on algorithm)
- Stems are often not valid English words, which can reduce interpretability
Different libraries, different results
- PorterStemmer vs. SnowballStemmer vs. EnglishStemmber may produce different stems
- There is no single “correct” stem—just different algorithmic approximations
Over-simplification of meaning
- "organization" and "organ" may stem to similar roots
- Words with different meanings can collapse into the same stem
Loss of nuance
- "running" vs "runner" → both may reduce to "run"
- Important grammatical or contextual differences are removed

✅ When stemming can be useful

When reducing vocabulary size is important (e.g., Bag-of-Words, TF-IDF)
For large-scale text classification where exact word form is less important
When speed and simplicity matter more than linguistic accuracy

💡 Practical takeaway

Before applying stemming, consider:

Testing multiple stemming algorithms (they will differ)
Comparing results across libraries
Using lemmatization instead when you want dictionary-based, more interpretable word forms

Manual preprocessing function

In [10]:

Copied!





def prepare_text(s):
    words = [word for word in word_tokenize(s)]
    words = [word for word in words if word not in punctuation]
    words = [word.lower() for word in words if word.lower() not in nltk_stop_words]
    words = [stemmer.stem(word) for word in words]
    return " ".join(words)
def prepare_text(s):
    words = [word for word in word_tokenize(s)]
    words = [word for word in words if word not in punctuation]
    words = [word.lower() for word in words if word.lower() not in nltk_stop_words]
    words = [stemmer.stem(word) for word in words]
    return " ".join(words)

In [11]:

Copied!





sentences = ['The technician was resolving technical issues quickly.',
             'The engineer resolved several technical problems efficiently.']
sentences_clean = [prepare_text(s) for s in sentences]
sentences_clean
sentences = ['The technician was resolving technical issues quickly.',
             'The engineer resolved several technical problems efficiently.']
sentences_clean = [prepare_text(s) for s in sentences]
sentences_clean

Out[11]:

['technician resolv technic issu quick',
 'engin resolv sever technic problem effici']

Create Bag-Of-Words count matrix

In [12]:

Copied!





count_vect = CountVectorizer()
counts = count_vect.fit_transform(sentences_clean)
df = pd.DataFrame(counts.toarray(), columns=count_vect.get_feature_names_out())
df
count_vect = CountVectorizer()
counts = count_vect.fit_transform(sentences_clean)
df = pd.DataFrame(counts.toarray(), columns=count_vect.get_feature_names_out())
df

Out[12]:

	effici	engin	issu	problem	quick	resolv	sever	technic	technician
0	0	0	1	0	1	1	0	1	1
1	1	1	0	1	0	1	1	1	0

Key CountVectorizer Parameters¶

Below are some of the most important parameters for CountVectorizer that control how raw text is converted into a document-term matrix.

analyzer Determines how text is processed into tokens.

"word" → standard word-level tokenization (default)
"char" → character-level tokens
"char_wb" → character n-grams within word boundaries
custom function → full control over preprocessing, tokenization, and n-grams

token_pattern Regular expression used to define what counts as a token (only used when analyzer='word' and no custom tokenizer is passed).

Default: (?u)\b\w\w+\b → words with 2+ characters
Can be customized to:
- include punctuation
- capture special patterns (e.g., ..., !!!)

Important: Ignored if you pass a custom tokenizer or analyzer

tokenizer Custom function that splits text into tokens.

Overrides token_pattern
Gives control over:
- punctuation handling
- custom splitting logic

lowercase Controls whether text is converted to lowercase.

True (default) → "Great" and "great" are treated the same
False → case-sensitive tokenization

stop_words Removes common words from the vocabulary.

"english" → built-in stop word list
list → custom stop words
None → keep all words

ngram_range Controls the range of n-grams to include.

(1,1) → unigrams (default)
(1,2) → unigrams + bigrams
(2,2) → bigrams only

Useful for capturing phrases like "not good"

max_features Limits the number of features (terms) to keep.

Keeps the top N most frequent terms

min_df / max_df Filters terms based on document frequency.

min_df → ignore rare terms
- e.g., min_df=2 → term must appear in at least 2 documents
max_df → ignore overly common terms
- e.g., max_df=0.95 → remove terms in 95%+ of documents

binary Controls whether counts are binary.

False (default) → term frequency counts
True → 0/1 (term present or not)

Create TF-IDF matrix

In [13]:

Copied!





tfidfTransformer = TfidfTransformer(smooth_idf=False, norm=None) 
tfidf = tfidfTransformer.fit_transform(counts)
df_tfidf = pd.DataFrame(tfidf.toarray(), columns=count_vect.get_feature_names_out())
df_tfidf
tfidfTransformer = TfidfTransformer(smooth_idf=False, norm=None) 
tfidf = tfidfTransformer.fit_transform(counts)
df_tfidf = pd.DataFrame(tfidf.toarray(), columns=count_vect.get_feature_names_out())
df_tfidf

Out[13]:

	effici	engin	issu	problem	quick	resolv	sever	technic	technician
0	0.000000	0.000000	1.693147	0.000000	1.693147	1.0	0.000000	1.0	1.693147
1	1.693147	1.693147	0.000000	1.693147	0.000000	1.0	1.693147	1.0	0.000000

Key TfidfTransformer Parameters¶

Below are some of the most important parameters for TfidfTransformer, which converts a document-term matrix (counts) into TF-IDF features.

use_idf Controls whether inverse document frequency (IDF) is used.

True (default) → applies TF-IDF weighting
False → uses only term frequency (TF)

smooth_idf Applies smoothing to the IDF calculation.

True (default) → prevents division by zero by adding 1 to document counts
False → uses the raw IDF formula

sublinear_tf Applies logarithmic scaling to term frequency.

False (default) → uses raw term counts
True → uses 1 + log(TF)

norm Controls normalization of TF-IDF vectors.

"l2" (default) → scales vectors so sum of squares = 1
"l1" → scales vectors so sum of absolute values = 1
None → no normalization

Create a StemmingTokenizer class

Instead of using a function to manually preprocess the text, let's create a class that can be passed as a custom stemming tokenizer to CountVectorizer(). We will use EnglishStemmer() and the sklearn ENGLISH_STOP_WORDS.

In this example, we will also add one additional sentence for example purposes:

"The engineer is not resolving the problem very quickly!"

We will also keep the stop word very and include the ! punctuation as a token.

In [14]:

Copied!





class StemmingTokenizer:
    def __init__(self, stop_words, stop_words_keep=None, token_pattern=None):
        # Instantiate EnglishStemmer()
        self.stemmer = EnglishStemmer()

        # Create stop_words_keep from input or empty list
        stop_words_keep = stop_words_keep or []
        self.stopWords = set([word for word in stop_words if word not in stop_words_keep])

        # Set a default token pattern
        self.token_pattern = token_pattern
    
    def __call__(self, doc):
        # use NLTK word_tokenize so punctuation is separated if applicable
        tokens = word_tokenize(doc.lower())
        
        # Optionally keep only tokens that match the regex pattern
        if self.token_pattern is not None:
            tokens = [
                t for t in tokens
                if re.fullmatch(self.token_pattern, t)
            ]
        
        # Step 3: remove stop words and stem remaining tokens
        return [
            self.stemmer.stem(t)
            for t in tokens
            if t not in self.stopWords
        ]
class StemmingTokenizer:
    def __init__(self, stop_words, stop_words_keep=None, token_pattern=None):
        # Instantiate EnglishStemmer()
        self.stemmer = EnglishStemmer()

        # Create stop_words_keep from input or empty list
        stop_words_keep = stop_words_keep or []
        self.stopWords = set([word for word in stop_words if word not in stop_words_keep])

        # Set a default token pattern
        self.token_pattern = token_pattern
    
    def __call__(self, doc):
        # use NLTK word_tokenize so punctuation is separated if applicable
        tokens = word_tokenize(doc.lower())
        
        # Optionally keep only tokens that match the regex pattern
        if self.token_pattern is not None:
            tokens = [
                t for t in tokens
                if re.fullmatch(self.token_pattern, t)
            ]
        
        # Step 3: remove stop words and stem remaining tokens
        return [
            self.stemmer.stem(t)
            for t in tokens
            if t not in self.stopWords
        ]

In [15]:

Copied!





sentences_new = sentences + ["The engineer is not resolving the problem very quickly!"]
stop_words_keep = ['very']
count_vect2 = CountVectorizer(tokenizer=StemmingTokenizer(ENGLISH_STOP_WORDS, 
                                                         stop_words_keep=stop_words_keep, 
                                                         token_pattern=r"(?u)\b\w+\b|\.{3,}|[!]+"), 
                             token_pattern=None)
counts2 = count_vect2.fit_transform(sentences_new)
df2 = pd.DataFrame(counts2.toarray(), columns=count_vect2.get_feature_names_out())
df2
sentences_new = sentences + ["The engineer is not resolving the problem very quickly!"]
stop_words_keep = ['very']
count_vect2 = CountVectorizer(tokenizer=StemmingTokenizer(ENGLISH_STOP_WORDS, 
                                                         stop_words_keep=stop_words_keep, 
                                                         token_pattern=r"(?u)\b\w+\b|\.{3,}|[!]+"), 
                             token_pattern=None)
counts2 = count_vect2.fit_transform(sentences_new)
df2 = pd.DataFrame(counts2.toarray(), columns=count_vect2.get_feature_names_out())
df2

Out[15]:

	!	effici	engin	issu	problem	quick	resolv	technic	technician	veri
0	0	0	0	1	0	1	1	1	1	0
1	0	1	1	0	1	0	1	1	0	0
2	1	0	1	0	1	1	1	0	0	1

Looking at the new output, we added the new tokens ! and veri as the word stem for the stop word very as designed. However, we lost the sever token for the word several. Let's go back and check for the word several in both stop word sources.

In [16]:

Copied!

word = 'several'
print(f"Check if {word} is present in NLTK stop_words: {word in nltk_stop_words}")
print(f"Check if {word} is present in sklearn ENGLISH_STOP_WORDS: {word in sk_stop_words}")
word = 'several'
print(f"Check if {word} is present in NLTK stop_words: {word in nltk_stop_words}")
print(f"Check if {word} is present in sklearn ENGLISH_STOP_WORDS: {word in sk_stop_words}")

Check if several is present in NLTK stop_words: False
Check if several is present in sklearn ENGLISH_STOP_WORDS: True

Online Discussions on Autos and Electronics Example¶

In this example, we will be working on a classification task to classify Internet discussion posts as either auto‐related or electronics‐related. One post looks like this:

From: smith@logos.asd.sgi.com (Tom Smith) Subject: Ford Explorer 4WD ‐ do I need performance axle?
We're considering getting a Ford Explorer XLT with 4WD and we have the following questions (All we would do is go skiing ‐ no off‐roading):
1. With 4WD, do we need the “performance axle” ‐ (limited slip axle). Its purpose is to allow the tires to act independently when the tires are on different terrain.
2. Do we need the all‐terrain tires (P235/75X15) or will the all‐season (P225/70X15) be good enough for us at Lake Tahoe?
Thanks,
Tom
–
================================================
Tom Smith Silicon Graphics smith@asd.sgi.com 2011 N. Shoreline Rd. MS 8U‐815 415‐962‐0494 (fax) Mountain View, CA 94043
================================================

The posts are taken from Internet groups devoted to autos and electronics, so are pre‐labeled. This one, clearly, is auto‐related. A related organizational scenario might involve messages received by a medical office that must be classified as medical or nonmedical (the messages in such a real scenario would probably have to be labeled by humans as part of the preprocessing).

Load data into corpus and labels

In [17]:

Copied!





auto_electronics_zip = os.path.join('..', 'data', 'AutoAndElectronics.zip')

corpus = []
labels = []
with zipfile.ZipFile(auto_electronics_zip) as rawData:
    for info in rawData.infolist():
        if info.is_dir():
            continue
        labels.append(1 if 'rec.autos' in info.filename else 0)
        corpus.append(rawData.read(info).decode('latin1'))

print('Corpus size:', len(corpus))
print('Labels size:', len(labels))
auto_electronics_zip = os.path.join('..', 'data', 'AutoAndElectronics.zip')

corpus = []
labels = []
with zipfile.ZipFile(auto_electronics_zip) as rawData:
    for info in rawData.infolist():
        if info.is_dir():
            continue
        labels.append(1 if 'rec.autos' in info.filename else 0)
        corpus.append(rawData.read(info).decode('latin1'))

print('Corpus size:', len(corpus))
print('Labels size:', len(labels))

Corpus size: 2000
Labels size: 2000

Split the data into training and test sets

When you fit TfidfTransformer, it learns:

IDF values (inverse document frequency)
Which words are “rare” vs “common” across the dataset

If you fit on the entire corpus, those IDF values are influenced by both training and test data and is an example of data leakage.

In [18]:

Copied!

X_train, X_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.30, random_state=SEED)

print('Training set:', len(X_train))
print('Test set:', len(X_test))
X_train, X_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.30, random_state=SEED)

print('Training set:', len(X_train))
print('Test set:', len(X_test))

Training set: 1400
Test set: 600

Create a new StemmingTokenizer

For this example, we are going to make one customization to the StemmingTokenizer and only keep tokens that are made up entirely of letters using .isalpha().

In [19]:

Copied!





class StemmingTokenizer:
    def __init__(self, stop_words, stop_words_keep=None, token_pattern=None):
        # Instantiate EnglishStemmer()
        self.stemmer = EnglishStemmer()

        # Create stop_words_keep from input or empty list
        stop_words_keep = stop_words_keep or []
        self.stopWords = set([word for word in stop_words if word not in stop_words_keep])

        # Set a default token pattern
        self.token_pattern = token_pattern
    
    def __call__(self, doc):
        # use NLTK word_tokenize so punctuation is separated if applicable
        tokens = word_tokenize(doc.lower())
        
        # Optionally keep only tokens that match the regex pattern
        if self.token_pattern is not None:
            tokens = [
                t for t in tokens
                if re.fullmatch(self.token_pattern, t)
            ]
        
        # Step 3: remove stop words and stem remaining tokens
        return [
            self.stemmer.stem(t)
            for t in tokens
            if t.isalpha() and t not in self.stopWords
        ]
class StemmingTokenizer:
    def __init__(self, stop_words, stop_words_keep=None, token_pattern=None):
        # Instantiate EnglishStemmer()
        self.stemmer = EnglishStemmer()

        # Create stop_words_keep from input or empty list
        stop_words_keep = stop_words_keep or []
        self.stopWords = set([word for word in stop_words if word not in stop_words_keep])

        # Set a default token pattern
        self.token_pattern = token_pattern
    
    def __call__(self, doc):
        # use NLTK word_tokenize so punctuation is separated if applicable
        tokens = word_tokenize(doc.lower())
        
        # Optionally keep only tokens that match the regex pattern
        if self.token_pattern is not None:
            tokens = [
                t for t in tokens
                if re.fullmatch(self.token_pattern, t)
            ]
        
        # Step 3: remove stop words and stem remaining tokens
        return [
            self.stemmer.stem(t)
            for t in tokens
            if t.isalpha() and t not in self.stopWords
        ]

In [20]:

Copied!





count_vect_train = CountVectorizer(tokenizer=StemmingTokenizer(ENGLISH_STOP_WORDS), token_pattern=None)
counts_train = count_vect_train.fit_transform(X_train)

tfidfTransformer = TfidfTransformer()
tfidf_train = tfidfTransformer.fit_transform(counts_train)
tfidf_train.shape
count_vect_train = CountVectorizer(tokenizer=StemmingTokenizer(ENGLISH_STOP_WORDS), token_pattern=None)
counts_train = count_vect_train.fit_transform(X_train)

tfidfTransformer = TfidfTransformer()
tfidf_train = tfidfTransformer.fit_transform(counts_train)
tfidf_train.shape

Out[20]:

(1400, 11761)

After transforming our text training data into a tfidf matrix, we now have:

1400 observations (documents)
11,761 features (unique tokens)

This a a good example of a very high-dimensional dataset:

The number of features is very large relative to the number of observations
This can create a sparse and noisy feature space
Many words appear infrequently, adding little predictive value

Reduce the dimensionality with Singular Value Decomposition

To address this, we will reduce the dimensionality of the data while preserving as much important information as possible. For manageability, we will limit the number of concepts to 20.

In [21]:

Copied!





lsa = make_pipeline(TruncatedSVD(20, random_state=SEED),
                 Normalizer(copy=False))
lsa_tfidf = lsa.fit_transform(tfidf_train)
lsa_tfidf.shape
lsa = make_pipeline(TruncatedSVD(20, random_state=SEED),
                 Normalizer(copy=False))
lsa_tfidf = lsa.fit_transform(tfidf_train)
lsa_tfidf.shape

Out[21]:

(1400, 20)

After applying Truncated SVD, our dataset has been transformed from a very high-dimensional space into a much more compact representation:

1,400 observations
20 terms (new features)

The original 11,000+ terms have been reduced to just 20 latent components. Each component represents a combination of words capturing underlying patterns in the text. Instead of working with individual words, we are now working with concept-level features.

✅ Why this is powerful

Reduces noise → removes many low-frequency or uninformative terms
Improves efficiency → faster model training and lower memory usage
Helps generalization → reduces overfitting by simplifying the feature space

Train model and make predictions

In [22]:

Copied!





# Full modeling pipeline
text_model = Pipeline([
    ('count_vectorizer', CountVectorizer(tokenizer=StemmingTokenizer(ENGLISH_STOP_WORDS), token_pattern=None)),
    ('tfidf', TfidfTransformer()),
    ('lsa', make_pipeline(
        TruncatedSVD(n_components=20, random_state=SEED),
        Normalizer(copy=False)
    )),
    ('log_reg', LogisticRegression(solver='lbfgs', max_iter=1000, random_state=SEED))
])

# Fit only on training data
text_model.fit(X_train, y_train)

text_model
# Full modeling pipeline
text_model = Pipeline([
    ('count_vectorizer', CountVectorizer(tokenizer=StemmingTokenizer(ENGLISH_STOP_WORDS), token_pattern=None)),
    ('tfidf', TfidfTransformer()),
    ('lsa', make_pipeline(
        TruncatedSVD(n_components=20, random_state=SEED),
        Normalizer(copy=False)
    )),
    ('log_reg', LogisticRegression(solver='lbfgs', max_iter=1000, random_state=SEED))
])

# Fit only on training data
text_model.fit(X_train, y_train)

text_model

Out[22]:

Pipeline(steps=[('count_vectorizer',
                 CountVectorizer(token_pattern=None,
                                 tokenizer=<__main__.StemmingTokenizer object at 0x00000215325C4890>)),
                ('tfidf', TfidfTransformer()),
                ('lsa',
                 Pipeline(steps=[('truncatedsvd',
                                  TruncatedSVD(n_components=20,
                                               random_state=42)),
                                 ('normalizer', Normalizer(copy=False))])),
                ('log_reg',
                 LogisticRegression(max_iter=1000, random_state=42))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [23]:

Copied!





text_pred = text_model.predict(X_test)
title1 = 'Logistic Regression Text Model'
text_metrics = evaluate_model(y_test, text_pred, beta=2, model_name=title1)
text_metrics
text_pred = text_model.predict(X_test)
title1 = 'Logistic Regression Text Model'
text_metrics = evaluate_model(y_test, text_pred, beta=2, model_name=title1)
text_metrics

Out[23]:

	Accuracy	Precision	Recall	F1	F2
Logistic Regression Text Model	0.958333	0.975945	0.940397	0.957841	0.947298

In [24]:

Copied!

fig, ax = plt.subplots(figsize=(4, 4))
plot_confusion_matrix(text_model, X_test, y_test, ax, title1)
fig, ax = plt.subplots(figsize=(4, 4))
plot_confusion_matrix(text_model, X_test, y_test, ax, title1)

No description has been provided for this image

In [ ]: