avatar Bite 311. Cleaning text with pandas

We recently published a blog post about how to approach Cleaning Text as part of a text mining project. In this Bite you are going to get a chance to put the theory into practice.

The challenge is:

1. Read some sample text into a Pandas Dataframe.

2. Clean the text as described below.

3. Calculate the TF-IDF for each word of the cleaned text.

To help you with this Bite you're provided with the function definitions for each of the steps. You are also provided with a Pandas Pipe that combines each of these functions together. Pandas pipes are a great way to create a library of reusable pandas data manipulation functions. A nice little example of how you can use Pandas pipes is given here.

You are also provided with a list of stop words to use, and a Pandas implementation of TF-IDF calculation including the completed TF-IDF function.

Most, if not all, of these functions could be combined into one or two larger functions. The purpose of this Bite is to also help you get familiar with the concepts involved here.

Load Sample Text

The first thing you need to do is to load the sample text into a dataframe. If you review the sample text file file you can see that the first line just contains a single word text. Use this as the column name for a single column dataframe  you'll create in this step. The follow 20 line can then be considered as sample documents that you want to calculate the tf-idf values for.  

Input: /tmp/samples.txt file (as downloaded in the template for you)
Output: A Pandas dataframe with a single column named  text (ignore the index) with 20 rows where each row contains a line from the text file that is imported.

Removing URLs and Emails

Strip all URLs (http://...) and Emails ([email protected]) from the text column.

Input: A dataframe with a single text column that may contain URL's and or Emails
Output: A dataframe with a single text column where all URL's and emails have been removed

Converting to Lowercase

Next convert all text characters to lower case

Input: A dataframe with a single text column, potentially of mixed case
Output: A dataframe with a single text column where all alphabetically characters in the text column are lower case

Removing Stop Words

A list of English stop words is provided. In this step all stop words should be removed from the text.

Input: A dataframe with a single text column that may contain English stop words
Output: A dataframe with a single text column where all stop words have been removed

Removing Non-Ascii Characters

Non-ascii characters are typically removed unless they might be useful for sentiment analysis or similar. For this exercise all non-ascii characters should be removed.

Input: A dataframe with a single text column that may include some non-ascii characters
Output: A dataframe with a single text column where all non-ascii characters are removed

Removing Digits and Punctuation

Remove all digits and punctuation.

Input: A dataframe with a single text column that may contain some digits or punctuation
Output: A dataframe with a single text column where all digits and punctuation are removed

Calculate TF-IDF

The text is now clean and you are ready to calculate the TF-IDF value for each word remaining in sample text. This is done for you and returned in the format of a Pandas Dataframe. Have a look at the /tmp/tf-idf.py file (also downloaded and imported for you), and see if there are any Pandas tricks you can learn.

Input: A Pandas dataframe with a single text column
Output: A Pandas dataframe representation of the TF-IDF calculation for each of the words in the text column in the input dataframe. Each words becomes a column and documents become rows so a cell contains the tf-idf score for the given word and document (example).

Sort TF-IDF Dataframe columns

Depending on how the preceding functions were implemented the order of the words in the TD-IDF dataframe might be different from the order used in the tests. Sort the columns alphabetically to overcome this. Remember, at any time you can review the tests if you are not sure what's expected of a particular function.

Input: The TF-IDF dataframe
Output: The TF-IDF dataframe sorted by column name

Login and get coding
go back Intermediate level
Bitecoin 5X

20 out of 25 users completed this Bite.
Will you be Pythonista #21 to crack this Bite?
Resolution time: ~113 min. (avg. submissions of 5-240 min.)
Pythonistas rate this Bite 7.33 on a 1-10 difficulty scale.
» Up for a challenge? 💪

We use Python 3.8