Login and get coding
We are getting started with machine learning (ML)! Are you excited? I hope so.
Machine Learning, at its core, is mainly about finding a good algorithm to learn a mapping between some input and a desired output. However, real-world problems do not start with selecting or even tuning some algorithm, they start with data. Sometimes, you do not even have the data yet! And when you have data, more often than not, the data is not yet ready to be used by an ML algorithm and have to be preprocessed first.
Lucky for you, you will use the tiny and well-prepared Iris data set so you don't have to deal with the many problems of bad data...at least not at the moment!
The Iris data set
Normally you would first have to load the dataset from some source like a database, a webpage, a Rest API or from the file system. Because this is the first bite of the ML learning path and because this is a great opportunity to get started with one of the main ML libraries for Python,
scikit-learn, you will use its resourceful
datasetsmodule. You can directly use
sklearn.datasets.load_iris(). Please have a look at the API to fully understand how this method works.
Familiarize yourself with the return value of
load_iris()and try to answer the questions in the code below. Just follow the provided docstring of each function. In this bite you will verify the information about the iris data set provided by the scikit-learn API. You will learn about ML, supervised and unsupervised learning soon enough, but a good data understanding is priceless for any data driven task.
Warning: Depending on your current level, you can choose to set the parameter
load_iris()returns a numpy
ndarrayinstead of a pandas
DataFrame. It is up to you to work with the chosen data model and to extract the information you are asked for.
The template provided for you assumes that you set
as_frame=Trueand pass the data frame as default value to the provided functions. Thus, you have to decide what to put into the square brackets for
IRIS_DATA[...]. If you want to go with numpy, this is of course obsolete.
Classes 3 Samples per class 50 Samples total 150 Dimensionality 4 Features real, positive Missing Attribute Values None Class Distribution 33.3% for each of 3 classes.
Feature min max mean std corr sepal length 4.3 7.9 5.84 0.83 0.7826 sepal width 2.0 4.4 3.05 0.43 -0.4194 petal length 1.0 6.9 3.76 1.76 0.9490 (high!) petal width 0.1 2.5 1.20 0.76 0.9565 (high!)
If you are curious to learn more, feel stuck or want to explore the world of ML a little bit(e) more, have a look at these resources. They are carefully selected to support you on your learning path.
- ML Glossary: Really helpful resource if you want to refresh a topic or the meaning of a term.
- Elements of AI: A wonderful, very high quality course about Artificial Intelligence (AI) and the role of ML in relation to AI. You can even get a certificate!
- Scikit-learn Cheatsheet: This cheatsheet will help you find the right model for your problem at hand.
- Choosing the Right ML algorithm: This article by Rajat Harlalka is a good and short summary of the most important points when working on a ML problem. It can help you identify areas you want to further explore.