05 - Twitter data analysis Part 2: Similar Tweeters

The Challenge
Start here

This challenge write-up first appeared on PyBites.

This week, each one of you has a homework assignment ... - Tyler Durden (Fight club)

Birds of a feather

A new week, more coding! In Part 2 of our Twitter data analysis we challenge you to find out how similar two tweeters are ...

Challenge

Make a script that receives two command line args: user1 and user2

$ similar_tweeters.py bbelderbos pybites
# ... some index of similarity ...

Get the last n tweets of these users. You can use the code of Part 1.
Tokenize the words in the tweets, filtering out stop words, URLs, digits, punctuation, words that only occur once or are less than 3 characters (and/or other noise ...)
Extract the main subjects the users tweet about. You could use Gensim, an NLP package for Topic Modeling. However feel free to take your own approach! We are dropping the helper template and external libs (requirements.txt) for this challenge, we'd love to see different approaches to this problem ...
Compare the subjects and come up with a similarity score.

Stay in sync with PyBites challenges repo

Start coding by forking our challenges repo:

$ git clone https://github.com/pybites/challenges

If you already forked it sync it:

# assuming using ssh key
$ git remote add upstream [email protected]:pybites/challenges.git 
$ git fetch upstream
# if not on master: 
$ git checkout master 
$ git merge upstream/master
# ... no helper template for this challenge ...

Good luck!

Remember: there is no best solution, only learning more Python.

Enjoy and we're looking forward reviewing our and your solutions on Friday.

Have fun!

About PyBites Code Challenges

More background in our first challenge article.

Hone Your Python Skills!