Skip to content

Added content on Grid Search and Box Plot using Python #702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed
68 changes: 68 additions & 0 deletions contrib/machine-learning/grid-search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Grid Search

Grid Search is a hyperparameter tuning technique in Machine Learning that helps to find the best combination of hyperparameters for a given model. It works by defining a grid of hyperparameters and then training the model with all the possible combinations of hyperparameters to find the best performing set.
The Grid Search Method considers some hyperparameter combinations and selects the one returning a lower error score. This method is specifically useful when there are only some hyperparameters in order to optimize. However, it is outperformed by other weighted-random search methods when the Machine Learning model grows in complexity.

## Implementation

Before applying Grid Searching on any algorithm, Data is used to divided into training and validation set, a validation set is used to validate the models. A model with all possible combinations of hyperparameters is tested on the validation set to choose the best combination.
Grid Searching can be applied to any hyperparameters algorithm whose performance can be improved by tuning hyperparameter. For example, we can apply grid searching on K-Nearest Neighbors by validating its performance on a set of values of K in it. Same thing we can do with Logistic Regression by using a set of values of learning rate to find the best learning rate at which Logistic Regression achieves the best accurac
Let us consider that the model accepts the below three parameters in the form of input:
1. Number of hidden layers [2, 4]
2. Number of neurons in every layer [5, 10]
3. Number of epochs [10, 50]

If we want to try out two options for every parameter input (as specified in square brackets above), it estimates different combinations. For instance, one possible combination can be [2, 5, 10]. Finding such combinations manually would be a headache.
Now, suppose that we had ten different parameters as input, and we would like to try out five possible values for each and every parameter. It would need manual input from the programmer's end every time we like to alter the value of a parameter, re-execute the code, and keep a record of the outputs for every combination of the parameters.
Grid Search automates that process, as it accepts the possible value for every parameter and executes the code in order to try out each and every possible combination outputs the result for the combinations and outputs the combination having the best accuracy.
Higher values of C tell the model, the training data resembles real world information, place a greater weight on the training data. While lower values of C do the opposite.

## Explaination of the Code

The code provided performs hyperparameter tuning for a Logistic Regression model using a manual grid search approach. It evaluates the model's performance for different values of the regularization strength hyperparameter C on the Iris dataset.
1. datasets from sklearn is imported to load the Iris dataset.
2. LogisticRegression from sklearn.linear_model is imported to create and fit the logistic regression model.
3. The Iris dataset is loaded, with X containing the features and y containing the target labels.
4. A LogisticRegression model is instantiated with max_iter=10000 to ensure convergence during the fitting process, as the default maximum iterations (100) might not be sufficient.
5. A list of different values for the regularization strength C is defined. The hyperparameter C controls the regularization strength, with smaller values specifying stronger regularization.
6. An empty list scores is initialized to store the model's performance scores for different values of C.
7. A for loop iterates over each value in the C list:
8. logit.set_params(C=choice) sets the C parameter of the logistic regression model to the current value in the loop.
9. logit.fit(X, y) fits the logistic regression model to the entire Iris dataset (this is typically done on training data in a real scenario, not the entire dataset).
10. logit.score(X, y) calculates the accuracy of the fitted model on the dataset and appends this score to the scores list.
11. After the loop, the scores list is printed, showing the accuracy for each value of C.

## Python Code

from sklearn import datasets

from sklearn.linear_model import LogisticRegression

iris = datasets.load_iris()

X = iris['data']

y = iris['target']

logit = LogisticRegression(max_iter = 10000)

C = [0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2]

scores = []

for choice in C:

logit.set_params(C=choice)

logit.fit(X, y)

scores.append(logit.score(X, y))

print(scores)

## Results

[0.9666666666666667, 0.9666666666666667, 0.9733333333333334, 0.9733333333333334, 0.98, 0.98, 0.9866666666666667, 0.9866666666666667]

We can see that the lower values of C performed worse than the base parameter of 1. However, as we increased the value of C to 1.75 the model experienced increased accuracy.
It seems that increasing C beyond this amount does not help increase model accuracy.
1 change: 1 addition & 0 deletions contrib/machine-learning/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@
- [TensorFlow.md](tensorFlow.md)
- [PyTorch.md](pytorch.md)
- [Types of optimizers](Types_of_optimizers.md)
- [Grid Search](grid-search.md)
104 changes: 104 additions & 0 deletions contrib/plotting-visualization/box-plot.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Box Plot

A box plot represents the distribution of a dataset in a graph. It displays the summary statistics of a dataset, including the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. The box represents the interquartile range (IQR) between the first and third quartiles, while whiskers extend from the box to the minimum and maximum values. Outliers, if present, may be displayed as individual points beyond the whiskers.

For example - Imagine you have the exam scores of students from three classes. A box plot is a way to show how these scores are spread out.

## Key Ranges in Data Distribution

The data can be distributed between five key ranges, which are as follows -
1. Minimum: Q1-1.5*IQR
2. 1st quartile (Q1): 25th percentile
3. Median: 50th percentile
4. 3rd quartile(Q3): 75th percentile
5. Maximum: Q3+1.5*IQR

## Purpose of Box Plots

We can create the box plot of the data to determine the following-
1. The number of outliers in a dataset
2. Is the data skewed or not (skewness is a measure of asymmetry of the distribution)
3. The range of the data

## Creating Box Plots using Matplotlib

By using inbuilt funtion boxplot() of pyplot module of matplotlib -

Syntax - matplotlib.pyplot.boxplot(data,notch=none,vert=none,patch_artist,widths=none)

1. data: The data should be an array or sequence of arrays which will be plotted.
2. notch: This parameter accepts only Boolean values, either true or false.
3. vert: This attribute accepts a Boolean value. If it is set to true, then the graph will be vertical. Otherwise, it will be horizontal.
4. position: It accepts the array of integers which defines the position of the box.
5. widths: It accepts the array of integers which defines the width of the box.
6. patch_artist: this parameter accepts Boolean values, either true or false, and this is an optional parameter.
7. labels: This accepts the strings which define the labels for each data point
8. meanline: It accepts a boolean value, and it is optional.
9. order: It sets the order of the boxplot.
10. bootstrap: It accepts the integer value, which specifies the range of the notched boxplot.

## Implementation of Box Plot in Python

### Import libraries
import matplotlib.pyplot as plt
import numpy as np

### Creating dataset
np.random.seed(10)
data = np.random.normal(100, 20, 200)
fig = plt.figure(figsize =(10, 7))

### Creating plot
plt.boxplot(data)

### show plot
plt.show()

### Implementation of Multiple Box Plot in Python
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(10)
dataSet1 = np.random.normal(100, 10, 220)
dataSet2 = np.random.normal(80, 20, 200)
dataSet3 = np.random.normal(60, 35, 220)
dataSet4 = np.random.normal(50, 40, 200)
dataSet = [dataSet1, dataSet2, dataSet3, dataSet4]
figure = plt.figure(figsize =(10, 7))
ax = figure.add_axes([0, 0, 1, 1])
bp = ax.boxplot(dataSet)
plt.show()

### Implementation of Box Plot with Outliers (visual representation of the sales distribution for each product, and the outliers highlight months with exceptionally high or low sales)
import matplotlib.pyplot as plt
import numpy as np

### Data for monthly sales
product_A_sales = [100, 110, 95, 105, 115, 90, 120, 130, 80, 125, 150, 200]
product_B_sales = [90, 105, 100, 98, 102, 105, 110, 95, 112, 88, 115, 250]
product_C_sales = [80, 85, 90, 78, 82, 85, 88, 92, 75, 85, 200, 95]

### Introducing outliers
product_A_sales.extend([300, 80])
product_B_sales.extend([50, 300])
product_C_sales.extend([70, 250])

### Creating a box plot with outliers
plt.boxplot([product_A_sales, product_B_sales, product_C_sales], sym='o')
plt.title('Monthly Sales Performance by Product with Outliers')
plt.xlabel('Products')
plt.ylabel('Sales')
plt.show()

### Implementation of Grouped Box Plot (to compare the exam scores of students from three different classes (A, B, and C))
import matplotlib.pyplot as plt
import numpy as np
class_A_scores = [75, 80, 85, 90, 95]
class_B_scores = [70, 75, 80, 85, 90]
class_C_scores = [65, 70, 75, 80, 85]

### Creating a grouped box plot
plt.boxplot([class_A_scores, class_B_scores, class_C_scores], labels=['Class A', 'Class B', 'Class C'])
plt.title('Exam Scores by Class')
plt.xlabel('Classes')
plt.ylabel('Scores')
plt.show()
1 change: 1 addition & 0 deletions contrib/plotting-visualization/index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# List of sections

- [Installing Matplotlib](matplotlib_installation.md)
- [Box Plot](box-plot.md)