PyBites Bite 346. Getting started with the IRIS data set

Bite 346. Getting started with the IRIS data set

You are getting started with machine learning (ML)! Are you excited? I hope so.

Machine Learning, at its core, is mainly about finding the right model (or algorithm, if you like) that can learn a mapping between some input and a desired output. However, real-world problems do not start with selecting or even fine-tuning some algorithm, they start with the actual data. Sometimes, you do not even have the data yet! And when you have data, more often than not, the data is not yet ready to be used by an ML algorithm and has to be preprocessed first. This preprocessing can mean to handle missing data, handle outliers or handle categorical data like text features.

Lucky for you, you will use the tiny and well-prepared Iris data set so you don't have to deal with the many problems of bad data...at least not for now!

The Machine Learning Workflow

I will include this section for all Bites related to ML so that you know where you are on your journey.

There are several possible ways to define a ML workflow, but at its core, you'll often find the following steps:

1. Gathering data

2. Data pre-processing

3. Research the model that will be best for the type of data

4. Training and testing the model

5. Evaluation

Read this article by Ayush Pant to learn more.

This Bite is about exploratory data analysis. Step one is already done for us by scikit-learn.

But you will not yet begin to process the data, you will start with gathering some core descriptive statistics to better understand the data.

The Iris data set

The famous Iris data set is a data set about three special plant species. You can find more information about the dataset in the User Guide of scikit-learn.org.

Normally you would first have to load the dataset from some source like a database, a webpage, a Rest API or from the file system. Because this is the first Bite of the ML learning path and because this is a great opportunity to get started with one of the main ML libraries for Python, scikit-learn, you will use its resourceful datasets module. You can directly use sklearn.datasets.load_iris(). Please have a look at the API to fully understand how this method works.

Your task

Familiarize yourself with the return value of load_iris() and try to answer the questions in the code below. Just follow the provided docstring of each function. In this Bite you will verify the information about the iris data set provided by the scikit-learn API.

Note that the provided code uses load_iris(as_frame=True, return_X_y=True), so the return value will be a tuple holding the data as a pandas DataFrame and the target column as a pandas Series:

Classes 3

Samples per class 50

Samples total 150

Dimensionality 4

Missing Attribute Values None

Class Distribution 33.3% for each of 3 classes.

Summary Statistics:

Feature min max mean std corr

sepal length 4.3 7.9 5.84 0.83 0.7826

sepal width 2.0 4.4 3.05 0.43 -0.4194

petal length 1.0 6.9 3.76 1.76 0.9490 (high!)

petal width 0.1 2.5 1.20 0.76 0.9565 (high!)

Warning: The summary statistics table is taken from the scikit-learn's documentation. This is not how you will implement the function. The function will return the summary statistics per feature, so you will have the statistics as rows and each feature as a column within your returned data frame.

When you are finished with this Bite, you have learned how to conduct a simple exploratory data analysis and how to calculate descriptive summary statistics. This is the very first and basic step into ML because you need to understand your data and the knowledge you have gained here will help you better understand visualizations, conduct outlier analysis and even interpret ML model outputs.

Hints

See the detailed docstrings for each function. Sometimes there is a Hint note that gives you a hint which pandas function might help you for a particular task.

This Bite expects some familiarity with the pandas library. At least the solution will be much much simpler when knowing the appropriate pandas functions.

Additional information for the curious

If you are curious to learn more, feel stuck or want to explore the world of ML a little bit(e) more, have a look at these resources. They are carefully selected to support you on your learning path.

- ML Glossary -> Really helpful resource if you want to refresh a topic or the meaning of a term.

- Elements of AI -> A wonderful, very high quality course about Artificial Intelligence (AI) and the role of ML in relation to AI. You can even get a certificate!

- Scikit-learn Cheatsheet -> This cheatsheet will help you find the right model for your problem at hand.

3.8 data wrangling machine learning numpy pandas scikit-learn +

Metrics »

16 out of 17 users completed this Bite.
Will you be the 17th person to crack this Bite?
Resolution time: ~98 min. (avg. submissions of 5-240 min.)

Focus on this Bite hiding sidebars, turn on Focus Mode.

Ask for Help

Hone Your Python Skills!

PyBites Platform

Bite 346. Getting started with the IRIS data set

The Machine Learning Workflow

The Iris data set

Your task

Hints

Additional information for the curious

Classes	3
Samples per class	50
Samples total	150
Dimensionality	4
Missing Attribute Values	None
Class Distribution	33.3% for each of 3 classes.

Feature	min	max	mean	std	corr
sepal length	4.3	7.9	5.84	0.83	0.7826
sepal width	2.0	4.4	3.05	0.43	-0.4194
petal length	1.0	6.9	3.76	1.76	0.9490 (high!)
petal width	0.1	2.5	1.20	0.76	0.9565 (high!)