Keyword Extraction with NLP: A Beginner's Guide

Collecting, analyzing, and acting on user feedback is a cornerstone of the user-centered design process. User feedback helps us understand customer needs and levels of satisfaction, and can help us determine where best to focus research and design efforts in order to have the greatest impact on user experience overall.

Thanks to improved tools—and perhaps to increasing pressure to make sure customers feel “heard”—many organization have gotten pretty good at collecting user feedback. I have not seen evidence, however, that as many are as good at analyzing and acting on that feedback. Organizations may have tons of data—say, 8,000 help tickets and customer comments compiled in a single .csv*—but when it comes to what insight lies within, all too often no one has any idea.

(*True story. More than once, actually. Much more.)

Generating insight from a multi-thousand line spreadsheet of free-form user comments can be tricky. You could look at every comment (if you have the project budget for it … and no regard for your personal sanity), or you could take a sample and hope for the best.

Natural Language Processing to the Rescue

A third approach is to use Natural Language Processing (NLP) to begin to understand the overall tenor of the dataset at a high level, then use that understanding to identify more focused lines of inquiry—either for applying to the data itself, or for using to guide related research.

A wide range of free Python NLP libraries offer some relatively easy-to-deploy tools that can help us uncover key features of large datasets. The tools we’ll explore in this article will allow us to create single and multi-term keyword lists (called n-grams) from a large (3,000+ records), unstructured data set. These lists will help us frame a basic high level understanding of the dataset’s contents.

Previews of keywords extracted from the sample data set in bar charts
Previews of top keywords extracted from the sample data set

Though this process, known as “text mining,” is necessarily reductive, it offers several points of entry into a dataset that otherwise might look hopelessly impenetrable:

A Caveat for Less Techincal Readers

If the thought of doing anything in code has you reaching for the “close” icon, hear me out first: Up until about a month ago, I didn’t know Python from Boa Constrictor. This article and accompanying repository is intended to provide a low-barrier entry point into getting started with text analysis and text mining—even if you’re new to manipulating data files with code.

Background, Sources, and Resources

Much of the code you’ll find below (and in the corresponding GitHub repository) has been stitched together from much more knowledgeable primary sources than yours truly—ones that I heartily recommend you check out. The alterations I’ve made have been in the spirit of making these tools more accessible to beginners, of making sure the code works in a standalone repository, and of providing a starting point from which the daring among you can begin to make your own explorations.

Want to read about background and requirements later? Skip to the "How to Extract Keywords with Natural Language Processing" section below

Sources for the NLP Code Blocks

Python Resources

Running the Repository

The repository for this toolset of operations and functions is stored as a Jupyter Notebook file. Jupyter Notebook is an open source web application that you can use to create and share documents that contain live Python code, equations, visualizations, and text. I’ve found it to be an easy (and forgiving) way to experiment with Python and build NLP operations to accomplish specific tasks.

To run the repository, you will need to set up a few things on your computer. You’ll need Python 3, Jupyter Notebook, and a handful of Python modules. Jupyter Notebook and all the modules can be installed with the PIP package installer that comes with Python.

Here’s what you’ll want to install for modules:

Once you’re up and running with Python, download the NLP Text Analysis repository from GitHub and copy the sample text files to your desktop. Run “jupyter notebook” from the repository directory, then just follow the steps listed below to generate lists and visualizations of normalized keywords and n-grams. Once you get the hang of it, swap out your own massive spreadsheet of unstructured comments and custom keywords and revel in the glory of conducting NLP text analysis all by yourself. 🎉

How to Extract Keywords with Natural Language Processing

1. Load the data set and identify text fields to analyze

Select the first code cell in the “text-analytics.ipynb” notebook and click the “run” button. Be sure to drag the “rfi-data.tsv” and “custom-stopwords.txt” files out onto the desktop; that’s where the script will look for them.

Follow the prompts to load your data. After you select your .tsv file, you’ll select the column that contains the data you want to analyze, and then review the most and least common words in the unprocessed text. These will help you identify any custom stop words you may want to add before normalizing the text.

2. Create a list of stop words

Stop words are commonly used words such as “the”, “a”, “an”, “in,” etc. that occur with high frequency in natural language, but which don’t communicate salient information about the meaning or subject of a message.

The NLTK module offers a list of the most common stop words in English, which we’ll import here. In this step, we’ll also add in a list of custom stop words based on the text we’re analyzing. You can modify this list in the “custom-stopwords.txt” file on your desktop.

Import libraries for text preprocessing code block
Importing stop words libraries

The list of “most frequently occurring words” we generated above offers some good candidates for customizing the custom stop words list. As you start to generate results below, you may want to come back to this step and add in additional stop words based on your content set to generate more useful results.

3. Pre-process the data set to get a cleaned, normalized text corpus

Pre-processing includes removing punctuation, tags, and special characters from the text, and then normalizing what’s left into recognizable words. The normalization process includes “stemming,” which removes suffixes and prefixes from word roots, and “lemmatization” which maps the remaining root forms (which may not always be proper words) back to an actual word that occurs in natural language.

Script block for stemming and lemmatizing corpus
Normalizing, stemming, and lemmatizing our corpus

These processes all together identify a canonical representative for a set of related word forms, which allows us to assess word frequency independent of morphological (word form) variations.

4. Extract most frequently occurring keywords and n-grams

Now we’re finally at a stage where we can generate a set of top keywords and n-grams, in our case two and three word phrases (bigrams and trigrams). These lists and charts are, of course, only a hint at all of the insight that might be contained in this text corpus, but they provide guidance on where we might need to look more closely or conduct additional research. They also offer a high level overview that is easily communicated to collaborators and stakeholders.

Top keywords by frequency
Top keywords by frequency for 3,892 Shoreline College "Request for Information" fields
Top bigrams by frequency
Top bigrams by frequency for 3,892 Shoreline College "Request for Information" fields
Top trigrams by frequency
Top trigrams by frequency for 3,892 Shoreline College "Request for Information" fields

These scripts will also send a .csv of terms and frequency data for each of these views to your desktop for direct access to the compiled results. If you find that some terms are generic to your dataset—in this example “would,” “like,” “need,” and “know” are not really conveying unique meaning—you can add those to your stop words list and run the analysis again.

5. Extract a list of top TF-IDF terms

TF-IDF, short for “Term Frequency–Inverse Document Frequency,” is a numerical statistic that is intended to reflect how important a word is to a document in a collection. A term’s TF-IDF value increases proportionally to the number of times a word appears in a document and is then offset by the number of documents in the corpus that contain the word. This helps to adjust for the fact that some words appear more frequently in general. The end result is that we end up with a list of words ranked by how important they are to the corpus as a whole:

TF-IDF terms for Shoreline College
Top TF-IDF term for 3,892 Shoreline College "Request for Information" fields

That said, I’m not going to claim I’ve yet gotten my head fully around what’s going on in the TF-IDF vectorization process. For a deeper analysis of extracting keywords via TF-IDF, check out Kavita Ganesan’s TF-IDF tutorial.

Next Steps

If you’ve dipped your toes into NLP or machine learning before, you now that these waters get deep very quickly. The goal of this article is to provide a “shallow end” that those of us new to NLP can use to get our grounding, and yet still accomplish something useful. It is written from the point of view of a beginner (me!) sharing his foothold on NLP with other beginners (you!)—and hopefully providing you with a valuable entry-level toolset you can bring to your next 8,000 row spreadsheet.

The output of this process is intended to give you a set of data points you can use to better understand the user feedback contained in large, unstructured data sets. It should also help you more easily focus future analysis and research activities. Once you’ve gotten the hang of the basic flow above, start tweaking variables and parameters. When you’re comfortable with the range of what you can do here, you’re ready to explore the many other ways to use these libraries and tools. Please do share what you learn!