Text Mining¶
Overview¶
For each step, read the explanation, then run the code cell(s) right below it.
You will practice how to:
- Perform core text preprocessing steps including punctuation removal, stop word filtering, and stemming
- Compare different stop word sources (NLTK vs. sklearn) and understand how they impact results
- Apply stemming techniques and review their limitations
- Create a Bag-Of-Words count matrix using
CountVectorizerand TF-IDF matrix usingTfidfTransformer - Build a custom preprocessing pipeline as a tokenizer class for text preprocessing
- Apply dimensionality reduction (Truncated SVD) to extract meaningful latent features
- Train a model for text classification and evaluate predictions
Import libraries¶
import sys
import os
from pathlib import Path
import zipfile
import re
import pandas as pd
import numpy as np
from zipfile import ZipFile
import nltk
from nltk.stem.snowball import SnowballStemmer, EnglishStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS
from sklearn.model_selection import train_test_split
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.linear_model import LogisticRegression
SEED = 42
#nltk.download('punkt_tab')
#nltk.download('stopwords')
# Import local libraries
root_dir = Path.cwd().resolve().parents[0]
sys.path.append(str(root_dir))
# Visualization functions
from src.visuals.make_plots import *
# Helper functions
from src.utils.helpers import *
# Load the "autoreload" extension so that code can change
%load_ext autoreload
#%reload_ext autoreload
# Always reload modules so that as you change code in src, it gets loaded
%autoreload 2
Text Preprocessing Examples¶
In the following cells, we will demonstrate how to do some simple manual text preprocessing using string operations and the nltk library:
- Remove punctuation
- Remove stop words
- Stem words
Remove punctuation from sample text
s = 'The technician was resolving technical issues quickly.'
words = [word for word in word_tokenize(s) if word not in punctuation]
words
['The', 'technician', 'was', 'resolving', 'technical', 'issues', 'quickly']
In this simple example, we removed all punctucation from the text. However, removing punctuation might be a problem in the text mining real-world applications. Punctuation can carry important meaning and context that may improve model performance:
Emphasis & sentiment
"This is amazing!!!"vs"This is amazing"- The repeated
!!!signals stronger emotion or excitement
Tone & intent
"Really?"vs"Really"- The question mark can indicate skepticism or confusion
Sarcasm or urgency
"Oh great..."vs"Oh great"- Ellipses (
...) can imply sarcasm or hesitation "Stop!!!"may indicate urgency or intensity
Stylistic patterns
- Excessive punctuation (e.g.,
!!!,???,?!) can be strong signals in tasks like spam detection, sentiment analysis, or social media classification
- Excessive punctuation (e.g.,
✅ When it does make sense to remove punctuation
- When punctuation does not add meaningful signal for your task
- When you want to reduce noise and simplify the feature space
- For models that rely on word frequency alone (e.g., Bag-of-Words without additional feature engineering)
💡 Practical takeaway
Instead of blindly removing punctuation, consider:
- Keeping punctuation as-is
- Converting punctuation into engineered features (e.g., count of
!, presence of?) - Using tokenizers that preserve punctuation when it is meaningful
Stop word sources
Let's review some sample output from two of the most common stop word sources:
NLTK stopwords
- Comes from the
nltk.corpus.stopwordslist - One of the oldest and most widely used stop word lists
- Designed for general-purpose NLP tasks
scikit-learn stopwords
- Available via
sklearn.feature_extraction.text.ENGLISH_STOP_WORDS - Built specifically for machine learning workflows (e.g., TF-IDF, vectorization)
- More tightly integrated into sklearn pipelines
def preview_stop_words(stop_words, ncolumns, nrows):
print(f'First {ncolumns * nrows} of {len(stop_words)} stopwords')
for i in range(0, len(stop_words[:(ncolumns * nrows)]), ncolumns):
print(''.join(word.ljust(13) for word in stop_words[i:(i+ncolumns)]))
NLTK stopwords usage
nltk_stop_words = sorted(stopwords.words('english'))
preview_stop_words(nltk_stop_words, 6, 30)
First 180 of 198 stopwords a about above after again against ain all am an and any are aren aren't as at be because been before being below between both but by can couldn couldn't d did didn didn't do does doesn doesn't doing don don't down during each few for from further had hadn hadn't has hasn hasn't have haven haven't having he he'd he'll he's her here hers herself him himself his how i i'd i'll i'm i've if in into is isn isn't it it'd it'll it's its itself just ll m ma me mightn mightn't more most mustn mustn't my myself needn needn't no nor not now o of off on once only or other our ours ourselves out over own re s same shan shan't she she'd she'll she's should should've shouldn shouldn't so some such t than that that'll the their theirs them themselves then there these they they'd they'll they're they've this those through to too under until up ve very was wasn wasn't we we'd we'll we're we've were weren weren't what when where which while who
sklearn.feature_extraction._stop_words usage
sk_stop_words = sorted(ENGLISH_STOP_WORDS)
preview_stop_words(sk_stop_words, 6, 30)
First 180 of 318 stopwords a about above across after afterwards again against all almost alone along already also although always am among amongst amoungst amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before beforehand behind being below beside besides between beyond bill both bottom but by call can cannot cant co con could couldnt cry de describe detail do done down due during each eg eight either eleven else elsewhere empty enough etc even ever every everyone everything everywhere except few fifteen fifty fill find fire first five for former formerly forty found four from front full further get give go had has hasnt have he hence her here hereafter hereby herein hereupon hers herself him himself his how however hundred i ie if in inc indeed interest into is it its itself keep last latter latterly least less ltd made many may me meanwhile might mill mine more moreover most mostly move much must my myself name namely neither never nevertheless next nine no nobody none noone nor not
Remove stop words from sample text
words = [word.lower() for word in words if word.lower() not in nltk_stop_words]
words
['technician', 'resolving', 'technical', 'issues', 'quickly']
In this example, we removed common stop words such as the and was. While this is a standard preprocessing step, it is not always the right choice for every machine learning task.
Stop words can carry important meaning, especially when they affect the structure or intent of a sentence:
Negation matters (critical for sentiment)
"This is not good"→ removing stop words →"good"- This completely flips the meaning of the sentence
Subtle differences in intent
"I do like this"vs"I like this"- Words like do can add emphasis or nuance
Phrase-level meaning
"to be or not to be"→ removing stop words destroys the structure and meaning- Many common phrases rely heavily on stop words
Context in short text
- In tweets, reviews, or short messages, stop words may make up a large portion of the signal
- Removing them can leave too little information
✅ When it does make sense to remove stop words
- When working with large documents where stop words add little distinguishing value
- In topic modeling or document classification where content words carry most of the signal
- When using simple models like Bag-of-Words or TF-IDF to reduce dimensionality
💡 Practical takeaway
Instead of automatically removing stop words, consider:
- Keeping all words and letting the model learn what matters
- Using custom stop word lists (e.g., keep words like not, no, never)
- Evaluating model performance with and without stop word removal
Stem words from sample text
stemmer = SnowballStemmer("english")
word_stems = [stemmer.stem(word) for word in words]
word_stems
['technician', 'resolv', 'technic', 'issu', 'quick']
In this example, we stemmed the words using the English language SnowballStemmer. This applied stemming to reduce words to their root form (e.g., resolving → resolv, technical → technic, issues → issu). While this helps consolidate similar words, it also introduces some important trade-offs.
Stemming algorithms are rule-based and not tied to a standard English dictionary, which can lead to unexpected or inconsistent outputs:
Not real words
"issues"→"issu""better"→"better"(unchanged depending on algorithm)- Stems are often not valid English words, which can reduce interpretability
Different libraries, different results
PorterStemmervs.SnowballStemmervs.EnglishStemmbermay produce different stems- There is no single “correct” stem—just different algorithmic approximations
Over-simplification of meaning
"organization"and"organ"may stem to similar roots- Words with different meanings can collapse into the same stem
Loss of nuance
"running"vs"runner"→ both may reduce to"run"- Important grammatical or contextual differences are removed
✅ When stemming can be useful
- When reducing vocabulary size is important (e.g., Bag-of-Words, TF-IDF)
- For large-scale text classification where exact word form is less important
- When speed and simplicity matter more than linguistic accuracy
💡 Practical takeaway
Before applying stemming, consider:
- Testing multiple stemming algorithms (they will differ)
- Comparing results across libraries
- Using lemmatization instead when you want dictionary-based, more interpretable word forms
Manual preprocessing function
def prepare_text(s):
words = [word for word in word_tokenize(s)]
words = [word for word in words if word not in punctuation]
words = [word.lower() for word in words if word.lower() not in nltk_stop_words]
words = [stemmer.stem(word) for word in words]
return " ".join(words)
sentences = ['The technician was resolving technical issues quickly.',
'The engineer resolved several technical problems efficiently.']
sentences_clean = [prepare_text(s) for s in sentences]
sentences_clean
['technician resolv technic issu quick', 'engin resolv sever technic problem effici']
Create Bag-Of-Words count matrix
count_vect = CountVectorizer()
counts = count_vect.fit_transform(sentences_clean)
df = pd.DataFrame(counts.toarray(), columns=count_vect.get_feature_names_out())
df
| effici | engin | issu | problem | quick | resolv | sever | technic | technician | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 |
| 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0 |
Key CountVectorizer Parameters¶
Below are some of the most important parameters for CountVectorizer that control how raw text is converted into a document-term matrix.
analyzer
Determines how text is processed into tokens.
"word"→ standard word-level tokenization (default)"char"→ character-level tokens"char_wb"→ character n-grams within word boundaries- custom function → full control over preprocessing, tokenization, and n-grams
token_pattern
Regular expression used to define what counts as a token (only used when analyzer='word' and no custom tokenizer is passed).
- Default:
(?u)\b\w\w+\b→ words with 2+ characters - Can be customized to:
- include punctuation
- capture special patterns (e.g.,
...,!!!)
Important: Ignored if you pass a custom tokenizer or analyzer
tokenizer
Custom function that splits text into tokens.
- Overrides
token_pattern - Gives control over:
- punctuation handling
- custom splitting logic
lowercase
Controls whether text is converted to lowercase.
True(default) →"Great"and"great"are treated the sameFalse→ case-sensitive tokenization
stop_words
Removes common words from the vocabulary.
"english"→ built-in stop word list- list → custom stop words
None→ keep all words
ngram_range
Controls the range of n-grams to include.
(1,1)→ unigrams (default)(1,2)→ unigrams + bigrams(2,2)→ bigrams only
Useful for capturing phrases like "not good"
max_features
Limits the number of features (terms) to keep.
- Keeps the top N most frequent terms
min_df / max_df
Filters terms based on document frequency.
min_df→ ignore rare terms- e.g.,
min_df=2→ term must appear in at least 2 documents
- e.g.,
max_df→ ignore overly common terms- e.g.,
max_df=0.95→ remove terms in 95%+ of documents
- e.g.,
binary
Controls whether counts are binary.
False(default) → term frequency countsTrue→ 0/1 (term present or not)
Create TF-IDF matrix
tfidfTransformer = TfidfTransformer(smooth_idf=False, norm=None)
tfidf = tfidfTransformer.fit_transform(counts)
df_tfidf = pd.DataFrame(tfidf.toarray(), columns=count_vect.get_feature_names_out())
df_tfidf
| effici | engin | issu | problem | quick | resolv | sever | technic | technician | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000000 | 0.000000 | 1.693147 | 0.000000 | 1.693147 | 1.0 | 0.000000 | 1.0 | 1.693147 |
| 1 | 1.693147 | 1.693147 | 0.000000 | 1.693147 | 0.000000 | 1.0 | 1.693147 | 1.0 | 0.000000 |
Key TfidfTransformer Parameters¶
Below are some of the most important parameters for TfidfTransformer, which converts a document-term matrix (counts) into TF-IDF features.
use_idf
Controls whether inverse document frequency (IDF) is used.
True(default) → applies TF-IDF weightingFalse→ uses only term frequency (TF)
smooth_idf
Applies smoothing to the IDF calculation.
True(default) → prevents division by zero by adding 1 to document countsFalse→ uses the raw IDF formula
sublinear_tf
Applies logarithmic scaling to term frequency.
False(default) → uses raw term countsTrue→ uses1 + log(TF)
norm
Controls normalization of TF-IDF vectors.
"l2"(default) → scales vectors so sum of squares = 1"l1"→ scales vectors so sum of absolute values = 1None→ no normalization
Create a StemmingTokenizer class
Instead of using a function to manually preprocess the text, let's create a class that can be passed as a custom stemming tokenizer to CountVectorizer(). We will use EnglishStemmer() and the sklearn ENGLISH_STOP_WORDS.
In this example, we will also add one additional sentence for example purposes:
"The engineer is not resolving the problem very quickly!"
We will also keep the stop word very and include the ! punctuation as a token.
class StemmingTokenizer:
def __init__(self, stop_words, stop_words_keep=None, token_pattern=None):
# Instantiate EnglishStemmer()
self.stemmer = EnglishStemmer()
# Create stop_words_keep from input or empty list
stop_words_keep = stop_words_keep or []
self.stopWords = set([word for word in stop_words if word not in stop_words_keep])
# Set a default token pattern
self.token_pattern = token_pattern
def __call__(self, doc):
# use NLTK word_tokenize so punctuation is separated if applicable
tokens = word_tokenize(doc.lower())
# Optionally keep only tokens that match the regex pattern
if self.token_pattern is not None:
tokens = [
t for t in tokens
if re.fullmatch(self.token_pattern, t)
]
# Step 3: remove stop words and stem remaining tokens
return [
self.stemmer.stem(t)
for t in tokens
if t not in self.stopWords
]
sentences_new = sentences + ["The engineer is not resolving the problem very quickly!"]
stop_words_keep = ['very']
count_vect2 = CountVectorizer(tokenizer=StemmingTokenizer(ENGLISH_STOP_WORDS,
stop_words_keep=stop_words_keep,
token_pattern=r"(?u)\b\w+\b|\.{3,}|[!]+"),
token_pattern=None)
counts2 = count_vect2.fit_transform(sentences_new)
df2 = pd.DataFrame(counts2.toarray(), columns=count_vect2.get_feature_names_out())
df2
| ! | effici | engin | issu | problem | quick | resolv | technic | technician | veri | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 |
| 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
| 2 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 |
Looking at the new output, we added the new tokens ! and veri as the word stem for the stop word very as designed. However, we lost the sever token for the word several. Let's go back and check for the word several in both stop word sources.
word = 'several'
print(f"Check if {word} is present in NLTK stop_words: {word in nltk_stop_words}")
print(f"Check if {word} is present in sklearn ENGLISH_STOP_WORDS: {word in sk_stop_words}")
Check if several is present in NLTK stop_words: False Check if several is present in sklearn ENGLISH_STOP_WORDS: True
Online Discussions on Autos and Electronics Example¶
In this example, we will be working on a classification task to classify Internet discussion posts as either auto‐related or electronics‐related. One post looks like this:
From: smith@logos.asd.sgi.com (Tom Smith) Subject: Ford Explorer 4WD ‐ do I need performance axle?
We're considering getting a Ford Explorer XLT with 4WD and we have the following questions (All we would do is go skiing ‐ no off‐roading):
1. With 4WD, do we need the “performance axle” ‐ (limited slip axle). Its purpose is to allow the tires to act independently when the tires are on different terrain.
2. Do we need the all‐terrain tires (P235/75X15) or will the all‐season (P225/70X15) be good enough for us at Lake Tahoe?
Thanks,
Tom
–
================================================
Tom Smith Silicon Graphics smith@asd.sgi.com 2011 N. Shoreline Rd. MS 8U‐815 415‐962‐0494 (fax) Mountain View, CA 94043
================================================
The posts are taken from Internet groups devoted to autos and electronics, so are pre‐labeled. This one, clearly, is auto‐related. A related organizational scenario might involve messages received by a medical office that must be classified as medical or nonmedical (the messages in such a real scenario would probably have to be labeled by humans as part of the preprocessing).
Load data into corpus and labels
auto_electronics_zip = os.path.join('..', 'data', 'AutoAndElectronics.zip')
corpus = []
labels = []
with zipfile.ZipFile(auto_electronics_zip) as rawData:
for info in rawData.infolist():
if info.is_dir():
continue
labels.append(1 if 'rec.autos' in info.filename else 0)
corpus.append(rawData.read(info).decode('latin1'))
print('Corpus size:', len(corpus))
print('Labels size:', len(labels))
Corpus size: 2000 Labels size: 2000
Split the data into training and test sets
When you fit TfidfTransformer, it learns:
- IDF values (inverse document frequency)
- Which words are “rare” vs “common” across the dataset
If you fit on the entire corpus, those IDF values are influenced by both training and test data and is an example of data leakage.
X_train, X_test, y_train, y_test = train_test_split(corpus, labels, test_size=0.30, random_state=SEED)
print('Training set:', len(X_train))
print('Test set:', len(X_test))
Training set: 1400 Test set: 600
Create a new StemmingTokenizer
For this example, we are going to make one customization to the StemmingTokenizer and only keep tokens that are made up entirely of letters using .isalpha().
class StemmingTokenizer:
def __init__(self, stop_words, stop_words_keep=None, token_pattern=None):
# Instantiate EnglishStemmer()
self.stemmer = EnglishStemmer()
# Create stop_words_keep from input or empty list
stop_words_keep = stop_words_keep or []
self.stopWords = set([word for word in stop_words if word not in stop_words_keep])
# Set a default token pattern
self.token_pattern = token_pattern
def __call__(self, doc):
# use NLTK word_tokenize so punctuation is separated if applicable
tokens = word_tokenize(doc.lower())
# Optionally keep only tokens that match the regex pattern
if self.token_pattern is not None:
tokens = [
t for t in tokens
if re.fullmatch(self.token_pattern, t)
]
# Step 3: remove stop words and stem remaining tokens
return [
self.stemmer.stem(t)
for t in tokens
if t.isalpha() and t not in self.stopWords
]
count_vect_train = CountVectorizer(tokenizer=StemmingTokenizer(ENGLISH_STOP_WORDS), token_pattern=None)
counts_train = count_vect_train.fit_transform(X_train)
tfidfTransformer = TfidfTransformer()
tfidf_train = tfidfTransformer.fit_transform(counts_train)
tfidf_train.shape
(1400, 11761)
After transforming our text training data into a tfidf matrix, we now have:
- 1400 observations (documents)
- 11,761 features (unique tokens)
This a a good example of a very high-dimensional dataset:
- The number of features is very large relative to the number of observations
- This can create a sparse and noisy feature space
- Many words appear infrequently, adding little predictive value
Reduce the dimensionality with Singular Value Decomposition
To address this, we will reduce the dimensionality of the data while preserving as much important information as possible. For manageability, we will limit the number of concepts to 20.
lsa = make_pipeline(TruncatedSVD(20, random_state=SEED),
Normalizer(copy=False))
lsa_tfidf = lsa.fit_transform(tfidf_train)
lsa_tfidf.shape
(1400, 20)
After applying Truncated SVD, our dataset has been transformed from a very high-dimensional space into a much more compact representation:
- 1,400 observations
- 20 terms (new features)
The original 11,000+ terms have been reduced to just 20 latent components. Each component represents a combination of words capturing underlying patterns in the text. Instead of working with individual words, we are now working with concept-level features.
✅ Why this is powerful
- Reduces noise → removes many low-frequency or uninformative terms
- Improves efficiency → faster model training and lower memory usage
- Helps generalization → reduces overfitting by simplifying the feature space
Train model and make predictions
# Full modeling pipeline
text_model = Pipeline([
('count_vectorizer', CountVectorizer(tokenizer=StemmingTokenizer(ENGLISH_STOP_WORDS), token_pattern=None)),
('tfidf', TfidfTransformer()),
('lsa', make_pipeline(
TruncatedSVD(n_components=20, random_state=SEED),
Normalizer(copy=False)
)),
('log_reg', LogisticRegression(solver='lbfgs', max_iter=1000, random_state=SEED))
])
# Fit only on training data
text_model.fit(X_train, y_train)
text_model
Pipeline(steps=[('count_vectorizer',
CountVectorizer(token_pattern=None,
tokenizer=<__main__.StemmingTokenizer object at 0x00000215325C4890>)),
('tfidf', TfidfTransformer()),
('lsa',
Pipeline(steps=[('truncatedsvd',
TruncatedSVD(n_components=20,
random_state=42)),
('normalizer', Normalizer(copy=False))])),
('log_reg',
LogisticRegression(max_iter=1000, random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('count_vectorizer',
CountVectorizer(token_pattern=None,
tokenizer=<__main__.StemmingTokenizer object at 0x00000215325C4890>)),
('tfidf', TfidfTransformer()),
('lsa',
Pipeline(steps=[('truncatedsvd',
TruncatedSVD(n_components=20,
random_state=42)),
('normalizer', Normalizer(copy=False))])),
('log_reg',
LogisticRegression(max_iter=1000, random_state=42))])CountVectorizer(token_pattern=None,
tokenizer=<__main__.StemmingTokenizer object at 0x00000215325C4890>)TfidfTransformer()
Pipeline(steps=[('truncatedsvd',
TruncatedSVD(n_components=20, random_state=42)),
('normalizer', Normalizer(copy=False))])TruncatedSVD(n_components=20, random_state=42)
Normalizer(copy=False)
LogisticRegression(max_iter=1000, random_state=42)
text_pred = text_model.predict(X_test)
title1 = 'Logistic Regression Text Model'
text_metrics = evaluate_model(y_test, text_pred, beta=2, model_name=title1)
text_metrics
| Accuracy | Precision | Recall | F1 | F2 | |
|---|---|---|---|---|---|
| Logistic Regression Text Model | 0.958333 | 0.975945 | 0.940397 | 0.957841 | 0.947298 |
fig, ax = plt.subplots(figsize=(4, 4))
plot_confusion_matrix(text_model, X_test, y_test, ax, title1)