Text Mining with Python Part 2: Improving Your Tokens
Better tokens == better text analytics.
Are you new to the tutorial series? Check out Part 1 here.
In the last tutorial, you learned the first step in text mining - tokenization.
Given that tokens are the foundation for text mining, crafting the best tokens possible is the topic of this week’s newsletter.
As always, it’s highly recommended that you follow along by writing and running the code that you see in this tutorial.
FYI - While I’m using Python in Excel for this tutorial series, the code is 99% the same whether you’re using Microsoft Excel or a technology like Jupyter Notebook.
Case Folding
In text mining, vocabulary is a critical concept. Vocabulary is the set of unique tokens across an entire collection of documents. Just so you know, a document can be any piece of free-form text: a book, a social media post, a Microsoft Word file, etc.
Let’s say you have a collection of customer service chats collected from your website. You can think of each chat as a document. In this case, the vocabulary would be the set of unique tokens across all customer service chats.
Just so you know, a collection of documents is typically called a corpus in text mining. Also, the plural of corpus is corpora. Now you can drop the lingo. 🤣
NOTE - For brevity, I won’t repeat working with Python in Excel as covered in Part 1.
To really cement this idea, consider the following:
And the code for your workbook:
from nltk.tokenize import word_tokenize
# A corpus with one document
corpus_1 = 'The quick brown fox jumped over the lazy dog.'
# Use the mighty NLTK to tokenize "words"
print(word_tokenize(corpus_1))If you’re new to Python, my Python in Excel Accelerator online course will teach you the fundamentals you need for analytics in a weekend.
Here’s the thing. By default, text mining in Python treats tokens as case-sensitive. In the example above, the tokens The and the are considered distinct and will both be added to the vocabulary.
Now, this is clearly not a great outcome because these words aren’t actually distinct in terms of adding meaning to the corpus. So, a common strategy in text mining is to use case folding:
from nltk.tokenize import word_tokenize
# A corpus with one document
corpus_1 = 'The quick brown fox jumped over the lazy dog.'
# Use the mighty NLTK to tokenize "words" with case folding
print(word_tokenize(corpus_1.lower()))NOTE - going forward, I’m not going to repeat imports in the Python formulas because subsequent cells “remember” the imports.
Case folding is simply converting all document tokens to lowercase or uppercase. Case folding is not a new idea.
For example, in the US, it’s common to receive mail from utilities or insurance companies that uses all uppercase characters. Ever wondered why? It’s because these organizations often use mainframes and COBOL, and uppercase case folding was the standard.
Looking at the output for cell C3 shows how the tokens have been normalized, so the token the will be added only once to the vocabulary.
While you can choose to case fold to either lowercase or uppercase, it’s most common to use lowercase folding in text mining.
Your default should be to case fold, but it does come at a price:
# Case folding will lose information
corpus_2 = "I'm traveling to Great Britain."
# Use the mighty NLTK to tokenize "words" with case folding
print(word_tokenize(corpus_2.lower()))The example above shows that case folding destroys the information that “Great Britain” is a proper noun due to its capitalization.
However, the benefits of case folding typically outweigh the loss of this information, so case folding is a standard step in a text preprocessing pipeline.
If you really need to retain proper nouns, the mighty Natural Language Toolkit (NLTK) does offer support for Named Entity Recognition (NER), but NER is beyond the scope of this tutorial series.
Next up, it’s time to deal with punctuation.
Punctuation
As you will learn in the next tutorial, it’s common to ignore punctuation in real-world text mining because punctuation doesn’t add much information.
Luckily, Python provides a list of common characters we can use to remove punctuation:
import string
# Display punctuation
print(string.punctuation)And this list can be used to extend the text preprocessing pipeline:
# First and second steps: case folding and tokenization
raw_tokens = word_tokenize(corpus_2.lower())
print(raw_tokens)
# Third step: removing punctuation
punctuation_removed = [token for token in raw_tokens if not token in string.punctuation]
print(punctuation_removed)The Python formula in C6 illustrates the concept of a text preprocessing pipeline in code. I’m going to extend the code throughout this tutorial so you can see the gradual transformation of the tokens through each step of the pipeline.
The Python formula also demonstrated how useful Python list comprehensions are with NLTK. In this case, the list comprehension only keeps tokens that are not in the punctuation list.
Next up: removing words that don't offer much information.
Stopwords
The English language is filled with words I call “syntactic sugar.” The words are used to make language flow, but rarely (if ever) add information/meaning. The technical term for these words is stopwords, and common English examples include: the, is, an, and as.
Because stopwords offer so little value, a common step in a text preprocessing pipeline is removing them from your tokenized documents. Not surprisingly, the NLTK offers support for stopword removal by providing stopword lists in multiple languages:
from nltk.corpus import stopwords
# List the languages with stopword files
print(stopwords.fileids())The code above simply shows you which languages have built-in stopword lists from NLTK. There is nothing stopping you from using these lists, altering them, or creating your own.
In this tutorial series, I will use NLTK’s English stopword list, but the concepts apply to any language. Here are the default English stopwords:
# Get the default NLTK English stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)The code above demonstrates how to load and display the English stopword list. One thing to note about the Python formula in cell C8 is that I’m using a Python set to improve performance.
The list of stopwords is quite extensive, and one of the things you must do is validate that, in fact, you want all of these tokens removed from your vocabulary.
For example, if you’re analyzing customer service chats, you might not want to remove words like: who, what, when, where, why, and how.
So, you should never use the NLTK stopword lists blindly. You should alter the lists for your particular use case by:
Removing words from the stopwords list.
Adding words to the stopwords list.
The following Python formula demonstrates an easy way to remove words from a stopwords list:
# Define list of words to remove from stopwords list
wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
# Remove words from stopword list
for word in wh_words:
stop_words.remove(word)
print(stop_words)As demonstrated by the code in cell C9, the process simply involves creating a list of words you want to remove from the stopwords list and then removing each of the words.
Adding words to the stopwords list is even easier:
# Define a list of words to add to stopwords list
add_words = ['turbo', 'awesome']
# Add words to stopword list
stop_words.update(add_words)
print(stop_words)Lastly, integrating stopword removal into the text preprocessing pipeline:
# First and second steps: case folding and tokenization
raw_tokens = word_tokenize(corpus_2.lower())
print(raw_tokens)
# Third step: remove punctuation
punctuation_removed = [token for token in raw_tokens if not token in string.punctuation]
print(punctuation_removed)
# Fourth step: remove stopwords
stopwords_removed = [token for token in punctuation_removed if not token in stop_words]
print(stopwords_removed) These four steps are part of every text preprocessing pipeline, but there’s more that you can do to improve the quality of your tokens.
For example, combining single tokens to create larger tokens.
N-Grams
So far, you’ve seen how the NLTK makes generating tokens fairly easy. Make no mistake, the NLTK is hiding a lot of smarts behind the scenes, so you’re standing on the shoulders of giants when you use the NLTK to mine your text data.
Also, you’ve seen so far how the NLTK generates single tokens from text. You can also use the NLTK to create combinations of tokens, and these are known as n-grams. For example, what you’ve seen so far are called unigrams or 1-grams.
You can also create more complex n-grams. The following Python formula incorporates n-grams into the text preprocessing pipeline. Specifically, generating two-way combinations of single tokens, which are called bigrams or 2-grams:
from nltk.util import ngrams
# First and second steps: case folding and tokenization
raw_tokens = word_tokenize(corpus_2.lower())
# Third step: remove punctuation
punctuation_removed = [token for token in raw_tokens if not token in string.punctuation]
# Fourth step: remove stopwords
stopwords_removed = [token for token in punctuation_removed if not token in stop_words]
# Fifth step: generate n-grams
bigrams = list(ngrams(stopwords_removed, 2))
print(bigrams)The above code shows the list of bigrams created using the NLTK’s ngrams() function. This function returns an object type that isn’t super useful, so I commonly cast it to be a Python list for ease of coding.
The returned list of bigrams is a list of Python tuples, containing every possible two-way combination of unigrams. As you will see in a later tutorial, each of these bigrams becomes a unique entry in the vocabulary.
Notice that the bigram great britain adds a token embodying the idea of the country (i.e., a proper noun), thereby potentially adding information to the vocabulary compared to just using unigrams.
Like case folding, n-grams are not a perfect solution. However, they perform remarkably well in many real-world text-mining scenarios. So, you should definitely experiment with them in your projects to see if they improve your results.
In theory, there’s no limit to the complexity of n-grams you can create. Here are trigrams:
# Get the trigrams
trigrams = list(ngrams(stopwords_removed, 3))
print(trigrams)The above code demonstrates the patterns you can use to create 4-grams, 5-grams, 6-grams, etc.
However, there’s a tradeoff. As your n-grams get larger, you can think of them as becoming more and more rare.
In practice, this means there are diminishing returns for larger n-grams. In my experience, I’ve rarely needed anything beyond trigrams, and usually bigrams are enough.
So, to summarize what you’ve learned so far in this tutorial series, here are the steps of your text preprocessing pipeline:
Case folding
Tokenization
Punctuation removal
Stopword removal
Generating n-grams
👉 Ready to learn more? Check out Part 3 here.
That’s it for this tutorial.
My next tutorial in this series covers the next stage of text mining - converting your tokens into the bag-of-words (BoW) model.
Stay healthy and happy data sleuthing!
👩🏫 Ready to Learn More Analytics Skills?
My paid subscribers have access to exclusive monthly live crash courses that include:
PDFs of all slides.
Excel workbooks, code, and data.
Recordings so you can learn on your schedule.
Here are some examples of my live crash courses:
















