Skip to content

Added sklearn.md file #721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions contrib/machine-learning/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# List of sections

- [Introduction to scikit-learn](sklearn-introduction.md)
- [Binomial Distribution](binomial-distribution.md)
- [Regression in Machine Learning](regression.md)
- [Confusion Matrix](confusion-matrix.md)
Expand Down
144 changes: 144 additions & 0 deletions contrib/machine-learning/sklearn-introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# scikit-learn (sklearn) Python Library

## Overview

scikit-learn, also known as sklearn, is a popular open-source Python library that provides simple and efficient tools for data mining and data analysis. It is built on NumPy, SciPy, and matplotlib. The library is designed to interoperate with the Python numerical and scientific libraries.

## Key Features

- **Classification**: Identifying which category an object belongs to. Example algorithms include SVM, nearest neighbors, random forest.
- **Regression**: Predicting a continuous-valued attribute associated with an object. Example algorithms include support vector regression (SVR), ridge regression, Lasso.
- **Clustering**: Automatic grouping of similar objects into sets. Example algorithms include k-means, spectral clustering, mean-shift.
- **Dimensionality Reduction**: Reducing the number of random variables to consider. Example algorithms include PCA, feature selection, non-negative matrix factorization.
- **Model Selection**: Comparing, validating, and choosing parameters and models. Example methods include grid search, cross-validation, metrics.
- **Preprocessing**: Feature extraction and normalization.

## When to Use scikit-learn

- **Use scikit-learn if**:
- You are working on machine learning tasks such as classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.
- You need an easy-to-use, well-documented library.
- You require tools that are compatible with NumPy and SciPy.

- **Do not use scikit-learn if**:
- You need to perform deep learning tasks. In such cases, consider using TensorFlow or PyTorch.
- You need out-of-the-box support for large-scale data. scikit-learn is designed to work with in-memory data, so for very large datasets, you might want to consider libraries like Dask-ML.

## Installation

You can install scikit-learn using pip:

```bash
pip install scikit-learn
```

Or via conda:

```bash
conda install scikit-learn
```

## Basic Usage with Code Snippets

### Importing the Library

```python
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
```

### Loading Data

For illustration, let's create a simple synthetic dataset:

```python
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
```

### Splitting Data

Split the dataset into training and testing sets:

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```

### Preprocessing

Standardizing the features:

```python
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```

### Training a Model

Train a Logistic Regression model:

```python
model = LogisticRegression()
model.fit(X_train, y_train)
```

### Making Predictions

Make predictions on the test set:

```python
y_pred = model.predict(X_test)
```

### Evaluating the Model

Evaluate the accuracy of the model:

```python
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
```

### Putting it All Together

Here is a complete example from data loading to model evaluation:

```python
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Preprocess data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
```

## Conclusion

scikit-learn is a powerful and versatile library that can be used for a wide range of machine learning tasks. It is particularly well-suited for beginners due to its easy-to-use interface and extensive documentation. Whether you are working on a simple classification task or a more complex clustering problem, scikit-learn provides the tools you need to build and evaluate your models effectively.