{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "0267b5f9", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "from ndslib.config import jupyter_startup\n", "jupyter_startup()" ] }, { "cell_type": "markdown", "id": "a60267cf", "metadata": {}, "source": [ "(scikit-learn)=\n", "# The Scikit-learn package\n", "\n", "Now that we have a grasp on some of the key concepts, we can start *doing* machine learning in Python. In this section, we'll introduce *Scikit-learn* (often abbreviated `sklearn`), which is the primary package we'll be working with throughout this chapter. Scikit-learn is the most widely-used machine learning package in Python—and for that matter, probably in any programming language. Its popularity stems from its simple, elegant interface, [stellar documentation](https://scikit-learn.org/stable/documentation.html), and comprehensive support for many of the most widely used machine learning algorithms (the main domain Scikit-learn doesn't cover is deep learning, which we'll discuss separately in {numref}`dl`). Scikit-learn provides well-organized, high-quality tools for virtually all aspects of the typical machine learning workflow, including data loading and preprocessing, feature extraction and feature selection, dimensionality reduction, model selection, and evaluation, and so on. We'll touch on quite a few of these as we go along.\n", "\n", "```{eval-rst}\n", ".. index::\n", " single: Scikit Learn\n", "```\n", "\n", "\n", "## The ABIDE-II dataset\n", "\n", "To illustrate how Scikit-learn works, we're going to need some data. Scikit-learn is built on top of Numpy, so in theory, we could use Numpy's random number generation routines to create suitable arrays for Scikit-learn, just as we did earlier. But that would be kind of boring. We already understand the basics of Numpy arrays at this point, so we can be a bit more ambitious here, and try to learn machine learning using a real neuroimaging dataset. For most of this chapter, we'll use a dataset drawn from the *Autism Brain Imaging Data Exchange II* (ABIDE II) project. ABIDE is an international consortium aimed at facilitating the study of autism spectrum disorder (ASD) by publicly releasing large-scale collections of structural neuroimaging data obtained from thousands of participants at dozens of research sites {cite}`di2017enhancing`. In this chapter, we'll use data from the second collection (hence the II in ABIDE II). To keep things simple, we're going to use a lightly preprocessed version of the ABIDE II dataset, provided in {cite}`bethlehem2020normative`.\n", "\n", "```{eval-rst}\n", ".. index::\n", " single: ABIDE-II\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "id": "f1f53736", "metadata": {}, "outputs": [], "source": [ "from ndslib.data import load_data\n", "abide_data = load_data(\"abide2\")" ] }, { "cell_type": "markdown", "id": "63866e7e", "metadata": {}, "source": [ "We can get a quick sense of the dataset's dimensions:" ] }, { "cell_type": "code", "execution_count": 3, "id": "5acc3569", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1004, 1446)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "abide_data.shape" ] }, { "cell_type": "markdown", "id": "4b9355e6", "metadata": {}, "source": [ "And here are the first five rows:" ] }, { "cell_type": "code", "execution_count": 4, "id": "dca5c8db", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | site | \n", "subject | \n", "age | \n", "sex | \n", "group | \n", "age_resid | \n", "fsArea_L_V1_ROI | \n", "fsArea_L_MST_ROI | \n", "fsArea_L_V6_ROI | \n", "fsArea_L_V2_ROI | \n", "... | \n", "fsCT_R_p47r_ROI | \n", "fsCT_R_TGv_ROI | \n", "fsCT_R_MBelt_ROI | \n", "fsCT_R_LBelt_ROI | \n", "fsCT_R_A4_ROI | \n", "fsCT_R_STSva_ROI | \n", "fsCT_R_TE1m_ROI | \n", "fsCT_R_PI_ROI | \n", "fsCT_R_a32pr_ROI | \n", "fsCT_R_p24_ROI | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "ABIDEII-KKI_1 | \n", "29293 | \n", "8.893151 | \n", "2.0 | \n", "1.0 | \n", "13.642852 | \n", "2750.0 | \n", "306.0 | \n", "354.0 | \n", "2123.0 | \n", "... | \n", "3.362 | \n", "2.827 | \n", "2.777 | \n", "2.526 | \n", "3.202 | \n", "3.024 | \n", "3.354 | \n", "2.629 | \n", "2.699 | \n", "3.179 | \n", "
1 | \n", "ABIDEII-OHSU_1 | \n", "28997 | \n", "12.000000 | \n", "2.0 | \n", "1.0 | \n", "16.081732 | \n", "2836.0 | \n", "186.0 | \n", "354.0 | \n", "2261.0 | \n", "... | \n", "2.809 | \n", "3.539 | \n", "2.944 | \n", "2.769 | \n", "3.530 | \n", "3.079 | \n", "3.282 | \n", "2.670 | \n", "2.746 | \n", "3.324 | \n", "
2 | \n", "ABIDEII-GU_1 | \n", "28845 | \n", "8.390000 | \n", "1.0 | \n", "2.0 | \n", "12.866264 | \n", "3394.0 | \n", "223.0 | \n", "373.0 | \n", "2827.0 | \n", "... | \n", "2.435 | \n", "3.321 | \n", "2.799 | \n", "2.388 | \n", "3.148 | \n", "3.125 | \n", "3.116 | \n", "2.891 | \n", "2.940 | \n", "3.232 | \n", "
3 | \n", "ABIDEII-NYU_1 | \n", "29210 | \n", "8.300000 | \n", "1.0 | \n", "1.0 | \n", "13.698139 | \n", "3382.0 | \n", "266.0 | \n", "422.0 | \n", "2686.0 | \n", "... | \n", "3.349 | \n", "3.344 | \n", "2.694 | \n", "3.030 | \n", "3.258 | \n", "2.774 | \n", "3.383 | \n", "2.696 | \n", "3.014 | \n", "3.264 | \n", "
4 | \n", "ABIDEII-EMC_1 | \n", "29894 | \n", "7.772758 | \n", "2.0 | \n", "2.0 | \n", "14.772459 | \n", "3080.0 | \n", "161.0 | \n", "346.0 | \n", "2105.0 | \n", "... | \n", "2.428 | \n", "2.940 | \n", "2.809 | \n", "2.607 | \n", "3.430 | \n", "2.752 | \n", "2.645 | \n", "3.111 | \n", "3.219 | \n", "4.128 | \n", "
5 rows × 1446 columns
\n", "\n", " | site | \n", "subject | \n", "age | \n", "sex | \n", "group | \n", "age_resid | \n", "
---|---|---|---|---|---|---|
0 | \n", "ABIDEII-KKI_1 | \n", "29293 | \n", "8.893151 | \n", "2.0 | \n", "1.0 | \n", "13.642852 | \n", "
1 | \n", "ABIDEII-OHSU_1 | \n", "28997 | \n", "12.000000 | \n", "2.0 | \n", "1.0 | \n", "16.081732 | \n", "
2 | \n", "ABIDEII-GU_1 | \n", "28845 | \n", "8.390000 | \n", "1.0 | \n", "2.0 | \n", "12.866264 | \n", "
3 | \n", "ABIDEII-NYU_1 | \n", "29210 | \n", "8.300000 | \n", "1.0 | \n", "1.0 | \n", "13.698139 | \n", "
4 | \n", "ABIDEII-EMC_1 | \n", "29894 | \n", "7.772758 | \n", "2.0 | \n", "2.0 | \n", "14.772459 | \n", "