avatar Bite 452. Getting started with the IRIS data set

We are getting started with machine learning (ML)! Are you excited? I hope so.

Machine Learning, at its core, is mainly about finding a good algorithm to learn a mapping between some input and a desired output. However, real-world problems do not start with selecting or even tuning some algorithm, they start with data. Sometimes, you do not even have the data yet! And when you have data, more often than not, the data is not yet ready to be used by an ML algorithm and have to be preprocessed first.

Lucky for you, you will use the tiny and well-prepared Iris data set so you don't have to deal with the many problems of bad data...at least not at the moment!

The Iris data set

The famous Iris data set is a data set about three special plant species. You can find more information about the dataset in the User Guide of scikit-learn.org.

Normally you would first have to load the dataset from some source like a database, a webpage, a Rest API or from the file system. Because this is the first bite of the ML learning path and because this is a great opportunity to get started with one of the main ML libraries for Python, scikit-learn, you will use its resourceful datasets module. You can directly use sklearn.datasets.load_iris(). Please have a look at the API to fully understand how this method works.

Your task

Familiarize yourself with the return value of load_iris() and try to answer the questions in the code below. Just follow the provided docstring of each function. In this bite you will verify the information about the iris data set provided by the scikit-learn API. You will learn about ML, supervised and unsupervised learning soon enough, but a good data understanding is priceless for any data driven task.

Warning: Depending on your current level, you can choose to set the parameter as_frame to False so that load_iris() returns a numpy ndarray instead of a pandas DataFrame. It is up to you to work with the chosen data model and to extract the information you are asked for.

The template provided for you assumes that you set as_frame=True and pass the data frame as default value to the provided functions. Thus, you have to decide what to put into the square brackets for IRIS_DATA[...]. If you want to go with numpy, this is of course obsolete.

Classes 3
Samples per class 50
Samples total 150
Dimensionality 4
Features real, positive
Missing Attribute Values None
Class Distribution 33.3% for each of 3 classes.


Summary Statistics:

Feature min max mean std corr
sepal length 4.3 7.9 5.84 0.83 0.7826
sepal width 2.0 4.4 3.05 0.43 -0.4194
petal length 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width 0.1 2.5 1.20 0.76 0.9565 (high!)


Resources

If you are curious to learn more, feel stuck or want to explore the world of ML a little bit(e) more, have a look at these resources. They are carefully selected to support you on your learning path.

- ML Glossary: Really helpful resource if you want to refresh a topic or the meaning of a term.

- Elements of AI: A wonderful, very high quality course about Artificial Intelligence (AI) and the role of ML in relation to AI. You can even get a certificate!

- Scikit-learn Cheatsheet: This cheatsheet will help you find the right model for your problem at hand.

- Choosing the Right ML algorithm: This article by Rajat Harlalka is a good and short summary of the most important points when working on a ML problem. It can help you identify areas you want to further explore.

Login and get coding
We use Python 3.8