This challenge write-up first appeared on PyBites.
This week, each one of you has a homework assignment ... - Tyler Durden (Fight club)
A new week, more coding! In Part 2 of our Twitter data analysis we challenge you to find out how similar two tweeters are ...
Make a script that receives two command line args: user1 and user2
$ similar_tweeters.py bbelderbos pybites
# ... some index of similarity ...
Get the last n tweets of these users. You can use the code of Part 1.
Tokenize the words in the tweets, filtering out stop words, URLs, digits, punctuation, words that only occur once or are less than 3 characters (and/or other noise ...)
Extract the main subjects the users tweet about. You could use Gensim, an NLP package for Topic Modeling. However feel free to take your own approach! We are dropping the helper template and external libs (requirements.txt) for this challenge, we'd love to see different approaches to this problem ...
Compare the subjects and come up with a similarity score.
Start coding by forking our challenges repo:
$ git clone https://github.com/pybites/challenges
If you already forked it sync it:
# assuming using ssh key
$ git remote add upstream [email protected]:pybites/challenges.git
$ git fetch upstream
# if not on master:
$ git checkout master
$ git merge upstream/master
# ... no helper template for this challenge ...
Remember: there is no best solution, only learning more Python.
Enjoy and we're looking forward reviewing our and your solutions on Friday.
Have fun!
More background in our first challenge article.