Login and get codingWe recently published a blog post about how to approach Cleaning Text as part of a text mining project. In this Bite you are going to get a chance to put the theory into practice.
The challenge is:
1. Read some sample text into a Pandas Dataframe.
2. Clean the text as described below.
3. Calculate the TF-IDF for each word of the cleaned text.
To help you with this Bite you're provided with the function definitions for each of the steps. You are also provided with a Pandas Pipe that combines each of these functions together. Pandas pipes are a great way to create a library of reusable pandas data manipulation functions. A nice little example of how you can use Pandas pipes is given here.
You are also provided with a list of stop words to use, and a Pandas implementation of TF-IDF calculation including the completed TF-IDF function.
Most, if not all, of these functions could be combined into one or two larger functions. The purpose of this Bite is to also help you get familiar with the concepts involved here.
Load Sample Text
The first thing you need to do is to load the sample text into a dataframe. If you review the sample text file file you can see that the first line just contains a single word
text
. Use this as the column name for a single column dataframe you'll create in this step. The follow 20 line can then be considered as sample documents that you want to calculate thetf-idf
values for.Input:
/tmp/samples.txt
file (as downloaded in the template for you)
Output: A Pandas dataframe with a single column namedtext
(ignore the index) with 20 rows where each row contains a line from the text file that is imported.Removing URLs and Emails
Strip all URLs (http://...) and Emails ([email protected]) from the text column.
Input: A dataframe with a single
text
column that may contain URL's and or Emails
Output: A dataframe with a singletext
column where all URL's and emails have been removedConverting to Lowercase
Next convert all text characters to lower case
Input: A dataframe with a single
text
column, potentially of mixed case
Output: A dataframe with a singletext
column where all alphabetically characters in thetext
column are lower caseRemoving Stop Words
A list of English stop words is provided. In this step all stop words should be removed from the text.
Input: A dataframe with a single
text
column that may contain English stop words
Output: A dataframe with a singletext
column where all stop words have been removedRemoving Non-Ascii Characters
Non-ascii characters are typically removed unless they might be useful for sentiment analysis or similar. For this exercise all non-ascii characters should be removed.
Input: A dataframe with a single
text
column that may include some non-ascii characters
Output: A dataframe with a singletext
column where all non-ascii characters are removedRemoving Digits and Punctuation
Remove all digits and punctuation.
Input: A dataframe with a single
text
column that may contain some digits or punctuation
Output: A dataframe with a singletext
column where all digits and punctuation are removedCalculate TF-IDF
The text is now clean and you are ready to calculate the TF-IDF value for each word remaining in sample text. This is done for you and returned in the format of a Pandas Dataframe. Have a look at the
/tmp/tf-idf.py
file (also downloaded and imported for you), and see if there are any Pandas tricks you can learn.Input: A Pandas dataframe with a single
text
column
Output: A Pandas dataframe representation of the TF-IDF calculation for each of the words in thetext
column in the input dataframe. Each words becomes a column and documents become rows so a cell contains the tf-idf score for the given word and document (example).Sort TF-IDF Dataframe columns
Depending on how the preceding functions were implemented the order of the words in the TD-IDF dataframe might be different from the order used in the tests. Sort the columns alphabetically to overcome this. Remember, at any time you can review the tests if you are not sure what's expected of a particular function.
Input: The TF-IDF dataframe
Output: The TF-IDF dataframe sorted by column name
26 out of 33 users completed this Bite.
Will you be the 27th person to crack this Bite?
Resolution time: ~118 min. (avg. submissions of 5-240 min.)
Our community rates this Bite 7.33 on a 1-10 difficulty scale.
» Up for a challenge? 💪