0% found this document useful (0 votes)
20 views101 pages

[Technical] Machine Learning U3-6 [2019 Pattern]

The document covers supervised learning with a focus on regression techniques, including bias, variance, underfitting, and overfitting. It discusses various regression methods such as linear regression, Lasso regression, and Ridge regression, along with evaluation metrics like MAE, RMSE, and R2. Additionally, it highlights the importance of model complexity and generalization in achieving accurate predictions in machine learning.

Uploaded by

Rutik Kohakade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
20 views101 pages

[Technical] Machine Learning U3-6 [2019 Pattern]

The document covers supervised learning with a focus on regression techniques, including bias, variance, underfitting, and overfitting. It discusses various regression methods such as linear regression, Lasso regression, and Ridge regression, along with evaluation metrics like MAE, RMSE, and R2. Additionally, it highlights the importance of model complexity and generalization in achieving accurate predictions in machine learning.

Uploaded by

Rutik Kohakade
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 101
‘Unit Il Supervised Learning : Regression Syllabus Bias, Variance, Generalization, Underfiting, Overfitting, Linear regression, Regression: Lasso regression, Ridge regression, Gradiem descent algorithm. Evaluation Metrics : MAE, RMSE, R2 Contents 3.1 Blas 3.2 Varlance 3.3 Underhting, Overfiting «..2cecececesevee, Maye22, ++ te Marks 6 3.4. Regression cae «» March-19,20, May-22, ©“ Marks 10 3.5 Gradient Descent Algorithm 3.6 Evaluation Metrics eH TE * In popular, a device studying, model analyses the information, find patterns jn y ‘and make predictions, While education, the model learns these patterns inside iy, dataset and applies them to test’ information for prediction. While Predictions, a distinction takes place among prediction values made by meang of the model and real values/expected values, and this difference is known as Diag mistakes or Errors due to bias, It can be defined as an inability of system algorithms such as Linear Regression to seize the real relationship between the Statistics factors, Each algorithm starts off evolved with some quantity of bigs because bias occurs from assumptions in the version, which makes the target feature easy to examine. A model has both ; 1) Low bias : A low bias version will make fewer assumptions about the form of the goal function, 2) High bias : A model with a excessive bias makes more assumptions, and the version becomes unable to seize the crucial features of our dataset. A high bias model also cannot carry out properly on new information. Generally, a linear algorithm has a high bias, because it makes them learn fast. The easier the algorithm, the better the unfairness it has possibly to be brought Whereas a nonlinear set of rules regularly has low bias. Some examples of device gaining knowledge of algorithms with low bias are Decision Trees, k-Nearest neighbors and Support vector machines. At the same time, an set of rules with excessive bias is Linear regression, Linear discriminant Analysis and Logistic regression, Ways to Reduce High Bias ‘High bias mainly occurs due to a far simple version. Below are a few approaches to lessen the excessive bias : © Increase the input features as the model is underfitted. © Decrease the regularization time period, © Use more complicated models, which included some polynomial features, The variance could specify the quantity of version within the prediction if the special education facts was used. In simple phrases, variance tells that how a good deal a random variable isn't the same as its expected value. Ideally, a model need to not range an excessive amount of from one training dataset to some othet, because of this the algorithm need to be correct in know-how the hidden mapping between inputs and output variables. Variance mistakes are either of low variance or excessive variance, TECHNICAL PUBLICATIONS® - an up-thust for knowledge eS Stent ow variance manner there's a small version inside the prediction of the goal characteristic with changes in the education facts set. At the identical time, High ‘variance shows a huge version in the prediction of the target feature with changes jn the education dataset. i : ‘A model that indicates high variance learns lots and perform properly with the schooling dataset, and does no longer generalize well with the unseen dataset. As a end result, one: of these model gives appropriate results with the schooling dataset but suggests excessive errors prices at the test dataset. since, with high variance, the version learns an excessive amount of from the dataset, it results in overfitting of the version. A model with excessive variance has the underneath issues : oA excessive variance version results in overfitting. © Increase version complexities. «Usually, nonlinear, algorithms have lots of flexibility to fit the version, have high variance. L Explain bias and its types 2 What is variance ? Variance * The variance would specify the amount of variation in the prediction if the one-of-a-kind training information changed into used. In easy words, variance tells that how plenty a random variable isn't the same as its predicted fee. Ideally, a version ought to not range an excessive amount of from one schooling dataset to any other, which means the set of rules should be right in understanding the hidden mapping among inputs and output variables. Variance errors are both of low variance or high variance. Low variance method there is a small variant in the prediction of the goal characteristic with changes in the training information set. At the equal time, High variance suggests a large variation within the prediction of the target function with modifications within the education dataset. ‘A model that suggests excessive variance learns plenty and carry out well with the schooling dataset, and does not generalize well with the unseen dataset. AS a result, such a version offers correct results with the schooling dataset however suggests excessive blunders quotes at the check dataset. TECHNICAL ‘puaLicaTIONS® = an up-thrust for knowledge ‘Supervised Leaming 3-4 Fearessn ‘Machine Learning the version learns too much from the dataset, it * Since, with excessive variance, it ive varian ends in overfitting of the version. A version with excessi ce has the underneath problems : ; © A high variance model ends in overfitting. © Increase model complexities. a s ; + Usually, nonlinear algorithms have quite a few flexibility to fit the version, hay, high variance. SURE Underfitting, Overfitting Overfitting and Underfitting are the 2 foremost troubles that arise in gadge, mastering and degrade the overall performance of the device studying models, The principal goal of every device getting to know model is to generalize properly. Here generalization defines the capability of an ML model to offer a appropriate output with the aid of adapting the given set of unknown enter, It manner after supplying training at the dataset, it may produce dependable ang correct output. Hence, the underfitting and overfitting are the two phrases that want fo be checked for the overall performance of the model and whether or not the version is generalizing properly or not. Before knowledge the overfitting and underfitting, let's recognize some simple term so that it will assist to understand this subject matter nicely. Signal : It refers back to the actual underlying sample of the statistics that facilitates the machine learning model to examine from the facts. Noise : Noise makes no sense and beside the point statistics that reduces the performance of the version. P * Bias : Bias is a prediction errors that is introduced in the version due to oversimplifying the system learning algorithms. Or it is the difference between the predicted values and the real values Variance : If the system getting to know version performs well with the education dataset, but does no longer perform properly with the test dataset, then variance takes place. Overfitting * Overftting happens when our system leaning version tries to cowl all the statistics points or more than the epecified information factors gift within the given dataset. Because of this, the model begins caching noise and inaccurate values gift ‘side the dataset, and some of these elements lessen the performance and accuracy of the model. The overfitted version has low bias and high variance. TECHNICAL PUBLICATIONS® - an upthust fr knowledgo | overfitting is the main hassle that happens in st Example : The concept of the Overfitting may beneath graph ofthe linear regression output « upervised studying. 'y be understood by means of the Fig. 3.3.1 Overfitting + As we are able to see from the above graph, the model attempts to cover all of the information factors gift inside the scatter plot. It may also appearance efficient, however in truth, it is not so. Because the purpose of the regression version to locate the first-class fit line, but right here we have got no longer got any exceptional healthy, so, it will generate the prediction errors. FERED How to Avoid Overfitting In Model + Both overfitting and underfitting purpose the degraded overall performance of the device mastering version. But the principle reason is overfitting, so there are some ways via which we can lessen the occurrence of overfitting in our model. © Cross-Validation © Training with more data @ Removing features © Early stopping the training © Regularization ° Ensembling TECHNICAL PUBLICATIONS® - an upthnst for knowledge Scanned with CamScanner | ee J Supervised Leaning. 3-6 Sees Se EEE Underfitting : our mate goining knowledge of version place 1 ae hion of the data, To avoid the Overttng ay be stopped at an early legrae) "%, + Underfitting takes seize ae ling = cough from the training records ae capable of seize te UN the model, the fed of ee ae Rane nol se not rest the fine healthy of the dominant fashion Withiy i. result, it can fal statistics, ion ist always capable of analyze g the versi In the case of underfiting, the it reduces the accuracy fe aes eee 7 ele bias and occasional variance. nderftted version has excessiv u f be tand the underftting the use of beneat ; oe We can unders linear regression version ? Fig, 3.3.2 Underfitting ‘As we are able to see from the above diagram, the model is not able to capture the records factors present in the plot . | BT How to Avoid Underitting | By growing the education time of the model. | | | . | : © By increasing the wide variety of functions. Goodness of Fit + The "Goodness of healthy” time period is taken from the facts, and the aim of te machine mastering fashions to reap the goodness of in shape. In informatin modeling, it defines how carefully the end result or expected valu true values of the dataset. one TECHNICAL PUBLICATIONS®. an up-thrust for knowledge po model with an amazin pe A ideally, Lt naben radia At ts between the undertited and overfitied. model ani ctions with zero vert l to acquire it. ‘mistakes, but in practice, it's far tough ‘As whilst we educate our version § for a time, the facts move down, and the identical fy ; the mistakes inside the education version fe aie duration, then the ree fe Bat i we tech the pecause of the overfiting, as the ye version may also decrease sion al dataset. The mistakes in the check dataset stat newness noe Present in the than the elevating of errors, easing, so the factor, just earlier is th pecomplishing an awesome version, COS EtOH, and we will stop here for « There are other strategies by usin Prato mcdall whilst aes 6 Which we are able to get an amazing factor ampli ‘ and validation dataset pling technique to estimate model accuracy pe Difference between Overfitting and Underfitting | sr. No Parameter Overfitting > SoG E _ Complexity I Woolduplerne Mag Misr eS te Reason Low bias, High variance ‘ eae Quantity of features Smaller quanuty of features, A larger quantity of feature. | 4 Regullarization More regularization Less regularization 1. What is overfitting and underftting in machine learning model ? Explain with example. Se 2, Difference between overftting and underiting. EA Regression «Regression is a technique for expertise the relationship between unbiased variables or functions and a structured variable or outcome. Outcomes can then be anticipated as soon as the connection among impartial and based variables has been estimated. Regression is a field of have a look at in data which paperwork a key a part of forecast models in system learning, It's used as an technique to predict non-stop effects in predictive modelling, so has application in forecasting, and predicting consequences from records. Machine learning regression generally TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Scanned with CamScanner Stpented Leambg py, ‘Machine Learning the records factors. The distance 4, a eae line is minimized £0 achieve the best suit line. is + classificati jon is one of the main applicatio, . ees at age gai knowledge of. Clssifeton . cat eet primarily ‘based on Tearned capabilities, while iepceaal of continuous results. Both are predictive modelling ting of is imperative as an method in ke ‘ Line of nice fit via Se aaige Peas ae Tadkone depend on labelled enter ang aye and output of the training records sho. on. The functions Iabelled go the version can apprehend tne relationship. Regression evalua used to apprehend the relationship between special independent variables : ‘Models which might be educated to fo dependent variable or final results. tito trends and outcomes will be skilled the use of regression strata ‘These models will examine the relationship between enter and output informe It can then forecast future developments from labelled education statistics expect consequences from unseen input data, or be used to apprehend gqp, ancient records. © As with all supervised gadget mastering, unique care must be taken to ensure, labelled schooling records is representative of the general populace. If , education information isnt consultant, the predictive vers'ch will be overtit records that doesn't represent new ‘and unseen statistics. This will bring te inaccurate predictions once the model is deployed. Because regression evalua entails the relationships of capabilities and outcomes, care must be taken include the right choice of features t00. ) EZR] Linear Regression | predictive evaluation. Linear regressig © Ibis a statistical technique that is used for makes predictions for continuous/real or numeric variables which include inom a revenue, age, product rate, and so forth. Linear regression algorithm suggests a linear courting among structured (y) and cone or extra independent (y) variables, hence called as linear regression. Sin linear regression suggests the Jinear dating, this means that it finds how the fe of the dependent variable is converting in step with the value of the unbiased varlableThe linear regression model gives a sloped straight line representing te relationship between the variables. ' TECHNICAL PUBLICATIONS® - an upthnst for knowledge Logistic regression is dear up the class ee hes Supervised. studyin i Set of rules that’s used to | discrete layout which includes st . Out Which inches ane ue have based variables in a binary ~ logistic regression set of rules works i the expli ‘ with the explicit variable tog } ther wit Sues | ‘or no longer unsolicited With zero of 1 mail, and many o 4 Yes or No, True or False, Spam “8, Its a predictive evaluation . alpen orithm which fi i ressic dy | + Lai een es eof god fan tor complicated price characteristic, This sipmoin tion or logistic feature that is a las information in logistic regression Thee 4s used to version the a oo = — tic can be represented as os lte™ aa | £X) = Output between the 0 and 1 va value. ” x = input to the function y e = base of natural logarithm to « When we provide the input valu : Picts eS (data) to the function, it gives the S-curve as a 0 sat Fig, 3.4.1 Logistic regression It makes use of the concept of threshold levels, values above the edge degree are at least one and values below the threshold level are rounded as rounded up to much as zero. TECHNICAL PUBLICATIONS? - ‘an up-thrust for knowledge el Scanned with CamScanner Supervised Leaming Regros, Machine Leaming + There are three types of logistic regression * © Binary(0/1, pass/fail) © Multi(cats, dogs, lions) Ordinal (low, medium, high) [EZE] Lasso Regression Lasso regression is any other the version It is similar to the ridge regression except the absolute weights instead of a rectangular of weights Since it takes absolute values, consequently, it is able to while ridge regression can simplest cut back it close to zero: It is likewise known as aé L1 regularization. The equation for Lasso regressi ion technique to lessen the complexi regularization techniq ity that penalty time period includes shrink the slope to may bes = giewind 222 DE |wl) Le, y) = Min (Davi wi xi)? A Dil al EZQZzJ scikit Learn Code for Lasso Regression {trom ssklearmt import linear_ model model Lasso(alpha ZEIT Ridge Regression Ridge regression is one of the maximum sturdy versions of linear regressi wherein a small quantity of bias is delivered so that we will get higher long ter predictions. The quantity of bias added to the version is called ridge regression penalty. can compute this penalty term with the aid of multiplying with the lambda to t squared weight of each individual features. ‘The equation for ridge regression might be : Us, y) = Min(D8 (yi-wi ni)? +2 Dp i)*) A standard linear or polynomial regression will fail if there may be higl collinearity among the unbiased variables, to be able to solve such problems, ridg regression may be used. Ridge regression is a regularization method, which is used to reduce thé complexity of the version. It is likewise referred to as L2 regularization. It helps to clear up the troubles if we have greater parameters than samples. TECHNICAL PuBLIcATIONs® - an up-thrust for knowledge 5 ing ~ | oH Sal p Sug scikit Learn Code for Ridge Regression ypervised Leaming - Regression aeeinanr masa import Ridge) as By xt UP 9A gs, n features = 24, 19 ple 9 $0 psancdom RandomStat(0) | 98 ado(a_ sample) | 92 Fograndn(o_ samples, features) of | s udge(alpha = 0.5) 59) 1g 10%» cro 60) Foe Write short note on types of regression, ‘ SES 5 What do you mean by linear regression 2 on? Which epics lations are best modeled by linear regression ? J 4, pla in deta ridge nd as rps, 4. What do you me by logistic regression? Explain say ._ Eo 5, How ridge regression Ielp for reguarizing linear models 2 Write aan aT —— | BB Gradient Descent Algorithm 4 Gradient Descent is an optimization algorithm in pe sit, the aid gadget mastering used to limit a m if iteratively moving towards the minimal fee of the characteristic e |» We essentially use this Cost function algorithm when we have to Jocate the least possible values which could fulfill a given fee function. In 1 gadget getting to know, yreater regularly that not we try to limit loss features like Mean Squared Error). By minimizing the loss ‘Minimum characteristic, we will Fig. 3.5.1 Gradient descent algorithm TECHNICAL PUBLICATIONS® - an up-thust for knowfedge Scanned with CamScanner Machine Lo 3-12 Supervised Lea! eet Se improve our model and Gradient Descent is one of the most Populs used for this cause. The graph above shows how exactly a Gradient a ls We first take a factor in the value function and begin shifting in direction of the minimum factor. The size of that step, or how quickly converge to the minimum factor is defined by Learning Rate. We cat location with better learning fee but at the risk of overshooting the the opposite hand, small steps/smaller gaining knowledge of charge number of time to attain the lowest point. Now, the direction wherein algorithm has to transport (closer to min important. We calculate this by way of using derivatives. You need tc with derivatives from calculus. A spinoff is largely calculated because the graph at any specific factor. We get that with the aid of finding line to the graph at that point. The extra steep the tangent, would s more steps would be needed to reach minimum point, much less : ‘suggest lesser steps are required to reach the minimum factor. Goer 1, Explain gradient descent algorithm with example. 2. Features of gradient descent algorithn. Evaluation Metrics Descent set of rules w * The essential step in any gadget mastering model is to evaluate the « the version. The Mean squared error, Mean absolute mistakes, Root Mec Error, and R-Squared or Coefficient of determination metrics are used to performance of the model in regression analysis * The Mean absolute mistakes represents the average of absolutely the among the real and expected values inside the dataset. It measures the ; the residuals within the dataset. * Mean Squared Error represents the average of the squared difference be original and expected values in the facts set. It measures the varian residuals. Root Mean Squared Error is the square root of mean squared blu measures the usual deviation of residuals. * The coefficient of determination or R-squared represents the percentag variance in the established variable that's defined via the linear regressio, TECHNICAL PUBLICATIONS® - an up-inus for knowledge we ———_—_ ——= 9! | gre tearing ees SO Sti, “18 as eee € algorithg, quis a sealeT003 rating Le. No mate the alg square will be less than one, 'les being small or large, the price of ret ,_ adjusted R squared is modified very, ee wide variety of independent variables Ree Square and it is adjusted for the 2; anne uch less than or equal to R ree eae and it's going to always be fools observations within the records and gf as below nis the variety of minima, is yariables in the information, wide variety of the impartial 5 will eat 3. go Differences Among Those Evaluation Metrics Mean Squared Error(MSE) and Roop mal) is alsy | = Mean §; "be famiige | prediction mistakes vinsvis Mean apsjeg Ste Eror penalizes the large the slope widely used than MSE to assess the overall ae (MAE). However, RMSE is the ane with other random fashions because it has the oe e the regression model Ms axis). 2 me gadgets as the struc uaggest that variable (Y-axis). gadgets as the structured : a differentiable featy; i teep might | , MSE is a di ature that makes it eas pe ra i eval aad it easy to camry out mathematical in lots of fashions, RMSE is used as a default monic regardless of being harder to interpret than MAR “ntltins Loss Fanton —— |, Thedectease fee of MAE, MSE, and RMSE imp iplies higher accura version. However, a better cost of R rectangular ie as aoe aa + R Squared and Adjusted R Squared are used f and ie d fot explaining how well the impartial variables within the linear regression model explaing the range in the dependent variable. R Squared value usually will increase with the addition of the Sr ct | independent variables which may lead to the addition ofthe wencniny voraten Sus a in our model. However, the adjusted R-squared solves this hassle. e | | + Adjusted R squared takes into account the variety of predictor variables, and it's miles used to decide the range of independent variables in our model. "The value difference i R dedi eee of Adjusted R squared decreases if the growth within the R square by the a additional variable isn’t widespread sufficient. ve + For comparing the accuracy among distinct linear regression fashions, RMSE is a a f ‘a higher choice than R Squared. + MAE (Mean Absolute Mistakes) represents the distinction among the unique and cn predicted values extracted via averaged the absolute difference over the facts set. : + (Mean Squared Error) represents the difference between the authentic and rites expected values extracted by squared the average distinction over the data set. e of the mamradel + RMSE (Root Mean Squared Error) is the mistake charge by way of the square root of MSE. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Scanned with CamScanner Classification : K-nearest neighbour, Su Boosting, Random Forest, Adaboos, Boeryre-Mticlass Classification, a problems, Variants of MuliclassClasiRewan One rnanced Multiclass Classification Evaluation a ind Score : Accuracy, pet Micro-Average Precision and Recall, Micro-dverage Frygh on, FHeor®_Cross-validation, Recall, Macro-Average F-score. rage F-score, Macro-Average Precision and ‘and Imbal [ Contents | 44 K Nearest Neighbour 42. Support Vector Machine Algorithm. . z May-10) 22, orn eee rast ) 43. Ensemble Leaming : Bagging, Boosting, Random Forest, Adaboost | | yi oe ove May-22, Ss saTeniwe ects tars 8 44 Binary-vs-Multiclass Classification, Balanced and Imbalanced Multiclass Classification 45. Variants of Mullciass Classification :One-vs-One and One-vs-Al 46 Evaluation Metrics and Score | 47 Micro-average Method 4.8 Macro-average 49 Cross Validation i Co) canned witn Lamscanner Machine Leeming eee K Nearest Neighbour | + KNearest Neighbour is one of the only Machine Learning slgorithms ba ‘on supervised learning approach. . KANN algorithm assumes the similarity between the brand new case available instances and placed the brand new case into the catego maximum similar to the to be had classes. | . KCN set of rules shops all of the to be had facts and classifies a new point based at the similarity. This means when new data seems then j effortlessly categorised into a properly suite class by using K-NN algori «© KANN eet of rule can be used for regression as well as for classification normally it's miles used for the classification troubles. + K-NN is a non-parametric algorithm, because of this it does no longer m assumption on underlying data. ~ It is also referred to as a lazy leamer set of rulcs because it does n research from the training set immediately a5 2 substirute it shops the dat at the time of class, it plays an movement at the dataset * The KNN set of rules at the schooling section simply stores the dataset an it gets new data, then it classifies that statistics into a class that is an a ‘similar to the brand new data. + Example : Suppose, we've an picture of a creature that looks much like dog, but we want to know both itis a cat or dog. So for this identity, we to use the KNN algorithm, because it works on a similarity degree. Oui version will discover the similar features of the new facts set to the cats an snap shots and primarily based on the most similar functions it will pl both cat or canine class. KERRI wy Do We Need KNN 7 * Suppose there are two categories, ie, category A and category B and wi brand new statistics point x1, so this fact point will lie within of these classe solve this sort of problem, we need 2 K-NN set of rules. With the help of we will without difficulty discover the category or class of a selected da Consider the undemeath diagram - TECHNICAL PUBLICATIONS® - an up-tmust for inonbedge d Whe | NEAL lop | cat and re able : KN d dogs cit in Before KN Fig. 1.4 Why do ve Aner KN How Does KNN Work 2 pees On +The KANN working canbe elaine’ onthe basis ofthe below algorithm : ep -1 Select the wide variety K ofthe acquaintances, sop -2: Calculate the Euclidean distance o K rarity offends sep -3: Take the K nearest neighbors as according to the calculated Euclidean stance. step -4: Among these ok pals, count number the number of the data péints in each ss. sup 5: Assign the brand new records points to that category for which the quantity ofthe neighbor is maximum. sep -6 : Our model is ready. + Suppose we've got a brand new information point and we want to place it in the required category. Consider the under image % | a 7 Fig. 4.1.2 KNN example TECHNICAL, PUBLICATIONS - an uptintfer tonledge ——— Scanned with CamScanner WW ee tsi Lear eee caleulating the Euclidean qi Seated Lean: Gascon elect » By ; tan« h nearest neighbours in category 4 me Set the nearest acquaintances, as 3 nts. J the undemeath image ‘earest associates in class B, Consider died 2 e e ® % Category A : 3 neighbours: & wre CategeryB 2 neighbours Category 8 08\ ny 4 o%, New data Point ® e Category A Xs Fig. 4.1.4 KNN example continue + As we are able to see the three nearest acquai : -quaintan | subsequently this new fact point must bclong ee eas | | 1. What is the KNN algorithm ? Explain in det 2. Why do you need KNN ? How it works is explained with an example [A Support Vector Machine Algorithm + Support Vector Machine or SVM is one of the most popular supervised leaming algorithms that is used for classification in addition to regression troubles. However, in general, it’s far used for classification problems in Machine Leaming. + The intention of the SVM set of rules is to create the satisfactory line or selection boundary that could segregate n-dimensional space into lessons in order that we are able to easily position the new information factor in the perfect category in the future. This best decision boundary is known as a hyperplane. + SVM chooses the extreme points/vectors that help in developing the hyperplane. These severe cases are known as help vectors and as a result a set of rules is called a support vector machine. Consider the below diagram wherein there are exceptional categories which are categorised the usage of a selection boundary or hyperplane TECHNICAL PUBLICATIONS® - an up-hnst for knowledge Machine Leeming Maxienurn margin — erplane oo ‘Negative hyperplane ~ Po vectors x Fig. 4.2.1 SVM representation * Example : SVM can be understood with the example that we've used inside the KNN classifier. Suppose we see a peculiar cat that also has some functions of Puppies, so if we want a version that may appropriately pick out whether it is a ‘at or dog, then any such version can be created via using the SVM algorithm. We will first train our version with plenty of snap shots of cats and dogs so that it can find out about exceptional functions of cats and puppies, after which we check 4t with this odd creature. So as the assist vector creates a selection boundary between those two facts (cat and dog) and chooses intense cases (help vectors), it'll See the extreme case of cat and dog. On the idea of the assist vectors, it will classify it as a cat. Consider the below diagram : a Fig. 4.2.2 SVM example TECHNICAL PUBLICATIONS® - ‘8n up-thrust for knowledge Leeming set of rules may be used for dete + Stent categorization and so forth ‘tion, photograph classification, textual Supervised Leeming : Classification [GBI Hyper Plane and Support Vectors in SVM Algorithm |. Bias Tee oa es imuleple lines/choice boundaries to segregate the } classes \-dimensi rea, but we want to find out the ple: it boundary that facilitates to classify the statistics factors. This pee es js called the hyperplane of SVM. Be » The dimensions of the hyperplane depend on the functions gift within the dataset, which means that if there are 2 functions (as shown in photograph), then the hyperplane can bea instantaneously line. And if there are 3 functions, then the hyperplane might be a 2-size aircraft ‘ + We constantly create a hyperplane that has a most margin, this means that the most distance between the facts factors, [EEE support Vectors + The statistics points or vectors which are the closest to the hyperplane and which affect the position of the hyperplane are termed as support vectors. Since those vectors guide the hyperplane, therefore referred to as a support vector. [EB How Does SVM Work 7 HT Linear SVM + The running of the SVM set of rules can be understood through using an instance. Suppose we've a dataset that has two tags (inexperienced and blue) and the dataset has features x1 and x2. We want a classifier which could classify the pair(x1, x2) of co-ordinates in either inexperienced or blue, Consider the under photo : TECHNICAL PUBLICATIONS® ~ an up-thrust for knowledge Scanned with CamScanner Ma Supervised Loaming : Cla SalI Fig, 4.2.3 Linear SVM is ight line, we are able ty * So as it's a far 2-d area so by using just using a straight without difficulty separate those instructions. But there may be multiple lines thay | may separate those lessons. Consider the beneath picture : Fig. 4.2.4 Linear SVM. understanding * Hence, the SVM algorithm helps to discover the first-rate line or selection boundary; this exceptional boundary or region is known as a hyperplane. SVM algorithm unearths the nearest point of the traces from each of the lessons. These Points are referred to as guide vectors, The distance between the vectors and the hyperplane is called the margin, And the purpose of SVM is to maximise this TECHNICAL PUBLICATIONS® - an up-trus for knomoage | Sepertsed Looming : Classification argin. The hyperplane with many um mas pyperplane gin is known as the most suitable Fig. 4.2.5 Linear ‘SVM hyperplane (G20) Non Linear Svik Jinearly arranged, then we can ation y F separate it through using a directly line, however for non-linear information, we can not draw an unmarried directly line. Consider the beneath picture : Fig, 4.2.6 Non‘linear SVM TECHNICAL PUBLICATIONS® - an upthus for knowledge Supervised Leaming : Classi MactinoLeaming 4210 sett * So to separate these data points, we need to feature one greater size. For cen we've used dimensions x and y, 20 for non-linear informatio, ay will upload a 34 dimension z, It can be calculated as : z= x2+y2 + By adding the third measurement, the sample area will become a5 below, photograph : Fig. 4.2.7 Non linear SVM with third measurement * So now, SVM will divide the datasets into instructions within the following way. Consider the under photo : Best hyperplane Fig. 4.2.8 Datasets representation TECHNICAL PUBLICATIONS® - an up-tnust for inowledge Supervised Leaming -Ctassifcation what are linear SVM explain with cramp i ——___— 2 pan vo ee SVM wth cmp | eae ER via prot 29 SVD thos wed wit dts 7 Pt lh cme ink ¢ Sh age SUSI ERO tor regression. Ng voting classifier «. what do you mean by SVM ? Explain wy SSO s Sisco AT Ee Sem | oO Ensemble Learning : Baggin , i. ‘Adaboost Sing, Boosting, Random Forest, po Bagging +. Write short note on Adaboos Gradient tre bag | | | strong and correct model. + The important additives of bagging approach ate : The random sampling with replacement (bootstrapping) and the set of homogeneous system studying algorithms (ensemble studying). The bagging system is pretty clean to apprehend, first it's miles extracted "n" subsets from the education set, then those subsets are used to train "n base novices of the same kind. For making a prediction, each one of the "n” freshmen are fed with the test sample, the output of each learner is averaged (in case of regression) or voted (in case of classification). Figure suggests an overview of the bagging architecture. Refer Fig. 43:1 on next page. + tis crucial to note that the quantity of subsets as well as the quantity of items per subset can be decided through the nature of ML hassle, the same for the form of ML set of rules to be used. + For enforcing bagging, scikit-research presents a feature to do it without problems. For a primary execution we most effectively want to provide some parameters along with the bottom learner, the wide variety of estimators and the most number of samples consistent with the subset. TECHNICAL PUBLICATIONS® an, up-thrust for knowledge: Scanned with CamScanner by averaging (a “The tna deisel tater the eave of ‘rvoting (nthe cave Fig. 4.3.1 Bagging TEOIINIGAL PUBLICATIONS - an ypthnst for knowledge m ‘ jeamer, like a logistic repressor ae application consisted of taking a single artificial neural network, feeding 4% "e®, support vector machine or an corn assignment through here eMmation and coaching It to perform ‘Then ensemble techniques had by peautify the overall Petfonnance of any Generally, ensemble strategies are_bui ” individual dedsion tees, as ve cn seen” Sd 9f oNpINg varinls of [EA Boosting ‘« Boosting fashions fall inside this circle of relatives of ensemble strategies. « Boosting, initially named hypothesis boosting ie or weighting the fol thsi wed to ect out poop af oak econ so tk every new leamer offers extra weight or is handiest educated with observations which have been poorly categorised via the previous newbies, + By doing this our team of fashions learns to make correct predictions on all types of data, now not just on the maximum commonplace or clean observations. Also, if one of the character fashions may be very terrible at making predictions on a few types of remark, it does not rely, as the alternative N-1 models will most possibly make up for it. : « Boosting should now not be confused with bagging, that is the alternative main own family of ensemble methods : At the same time as in bagging the susceptible novices are educated in parallel the use of randomness, in boosting the rookies are educated sequentially, which will be capable of carry out the task of statistics ‘weighting /filtering defined in the preceding paragraph. |) = EZ) = - ag LE = Learn) VE SO ae Sie Bagging ac | Fig. 4.3.2 Bagging and boosting TECHNICAL PUBLICATIONS® = an upthrust for knowledge Scanned with CamScanner ———x—_— Machine Leaming fas boosting the see from the previous photograph, in boosting th cial sea * Seatuve importance or weights epreenied inside the feos aaa ine ene time as in bagging all beginets have the same Weigh inside the very last decision. 7 ete weighted (represented with the aid of it facts set is " ; et ae ae aoe ‘records factors), so that observations that have », ass i re significance within incorrectly labelled by way of classifier n are given mo thin a schooling of model n + 1, even as in bagging the education samples are tq,” randomly from the complete populace. [EER] Random Forest Random forest is a famous system learning set of rules that belongs tos, supervised getting to know method. It may be used for both classification ang regression issues in ML. It is based (olally on the concept of ensemble studying | that's a process of combining multiple classifiers to solve a complex problem ang to enhance the overall performance of the model. + As the call indicates, "Random forest is a classifier that incorporates some of choice timber on diverse subsets of the given dataset and takes the average to « improve the predictive accuracy of that dataset.” Instead of relying on one decision tree, the random forest takes the prediction from each tree and primarily based on most of the people's votes of predictions, and it predicts the very last output. * The more wider variety of trees within the forest results in better accuracy and Prevents the hassle of overfitting. How Does Random Forest Algorithm Work ? + Random forest works in two-section first is to create the random woodland by combining N selection trees and second is to make predictions for each tree created inside the first segment. * The working technique may be explained within the below steps and diagram : Step - 1: Select random K statistics points from the schooling set. Supervised Learning - Cass Step~2: Build the selection trees associated with the selected information Points (Subsets). Step - 3: Choose the wide variety N for selection trees which we want to build Step - 4: Repeat step 1 and 2 Step - §: For new factors, locate the predictions of each choice tree and assign the new ecords factors to the category that wins most people's votes, TECHNICAL PUBLICATIONS® - en upsmust for knowledge ample: S . cc i aed ataset that includes more than one fruit J divided into subsets and given 4g na’ Wooded area classifier. The dataset Fig. 4.3.3 Example of random forest [EFA Applications of Random Forest ‘There are specifically 4 sectors where random forest normally used : 1. Banking : Banking zone in general uses this algorithm for the identification of Joan danger. 2, Medicine : With the assistance of this set of rules, disorder traits and risks of the disorder may be recognized. 3. Land use : We can perceive the areas of comparable land use with the aid of this algorithm. 4, Marketing : Marketing tendencies can be recognized by the usage of this algorithm. TECHNICAL PUBLICATIONS® - an upsthrst for knowledge Scanned with CamScanner 4-16 supervised Leaming: Cressi, A SCS (EEEET Advantages of Random Forest : “a Random forest is able to appearing both classification and Seana TESPonsibilitig * It is capable of managing large datasets with high dimensional i © Tt enhances the accuracy of the version and forestalls the overfitting trouble. Disadvantages of Random Forest ' + Although random forest can be used for both class and regression responsibilte, it isn't extra appropriate for regression obligations. Adaboost * AdaBoost also referred to as adaptive boosting is a method in Machine Learning used as an ensemble method. The maximum not unusual algorithm used with AdaBoost is selection trees with one stage meaning with decision trees with most effective 1 split. These trees also are referred to as decision stumps. * The working of the AdaBoost version follows the beneath-referred to path : = Creation of the base learner. = Calculation of the total error via the beneath formulation. = Calculation of performance of the decision stumps. + Updating the weights in line with the misclassified factors. Fig, 4.3.4 Adaboost Creation of a new database : AdaBoost ensemble : + In the ensemble approach, we upload the susceptible fashions sequentially and then teach them the use of weighted schooling records. * We hold to iterate the process till we gain the advent of a preset range of vulnerable learners or we can not look at further improvement at the dataset. At the end of the algorithm, we are left with some vulnerable learners with a stage fee. Garten 1. Write short note on TECHNICAL PUBLICATIONS® - an up-hust fr knowledge ie ord 7 7 wt pinary-vs-Multiclass Class op and Imbalanced Multiclas. ‘Supervised Learning : Classification fication, Balanced S Classification go What is Binary Classification , this a procedure or challenge of tye, in whic into two training. Ifs basically a Ling of Fan as is oe categorised tne issue belongs to Out Which of two agencies sume, two emails Let us assume, are dispatched to a a “i yi You, one is sent via an insurance enterprise that continues sending their commercials and the opposite is sent out of your financial ae aee Your credit score card invoice. The electronic snail service provider will classify the two emails the primary one could be sent to the spam folder and the second one could be stored within the primary. one: + This process is called binary class, as there are fvo discret is and the opposite is number one. So, this is a Pepin a = ny one is spam 744.2 What is Multiclass Classification 2 + Classification is categorising facts and forming businesses primarily. based on the similarities. In a dataset, the independent veriables ur features play @ crucial role in classifying our statistics. When we speak approximately multiclass classification, we've got extra than classes in our structured or target variable. 2 e 4 a x ° x ce 4A 5c x o70 2S ° o8o My % Binary classification Multi-class classification Fig, 4.4.4 Binary-vs-Multiclass classification Parameters ‘Multicclass classification t ‘ aes sn eae ‘ : Ae tae es Ik is a type of two groups, ie. ere may be any variely of pee” ‘Hashes bjecs nat moximim two. insucions in i clases the tem z classes. nto more than lessons. “Algorithms used yk neighbours _ Decision trees © _k-Nearest neighbours TECHNICAL PUBLICATIONS® - ‘an up-thrust for knowisdge Scanned with CamScanner Balanced and Imbalanced Classification ; | Balanced classification : . When the usage of a device gains knowledge of a set of rules, it’s miles crucial to teach the model on a dataset with nearly the same number of sami This is referred to as a balanced elegance. We want to have balanced train train a version, however if the training isn't balanced, we need to use g balanéing method before using a device gaining knowledge of the set of rules, Balanced classes 500- Fig, 4.4.2 Balanced classification TECHNICAL PUBLICATIONS® - an uptinst fr knowledge ino Learning, pee 449 ag oe imbalanced Classification etd cig Ctasteatn + In imbalanced type trouble is ay distribution of examples across the ae of a tribution can vary f ognized CoE stare WY OM & moderate bia 4 ‘lasses is biassed or skewed. The ig one instance in the minority clas, 1 aN excessive imbalance where there of examples in the majority senifcen ee heaps or thousands and thousands instructions, ees 4. What fs the difference betocen binary and multiclass clssiicaion ? 2, What is the difference balanced and imbalanced multicassclasfcation ? [GA Variants of Multiclass Classification : One-vs-One and One-vs-All * Although many class issues can be described using two classes (they're inherently multi-class classifiers), some are described with extra than two classes which calls for adaptations of devices gaining knowledge of algorithms, = Logistic regression can he obviously prolonged to multi-class mastering troubles by using changing the sigmoid feature with the softmax characteristic. The KNN set of rules is also truthful to increase to multiclass cases. When we discover kk closest examples using a distance metric which includes euclidean distance, for the enter xx and take a look at them, we go back to the elegance that we noticed the maximum a few of the kk examples. Multi-magnificence labelling is likewise trivial with Naive Bayes classifier. TECHNICAL PUBLICATIONS® - an upthnst fr knowledge | See => TE Scanned with CamScanner Loni pm Soreatieenty: cast, Machine Leeming SVM cannot be obviously prolonged to mult-elegance eae ao may ve cued out extra effectively in te binary case, What sh ee me nae je a multi-magnificence problem however 4 binary type 1B Set of rules ? « One commonplace approach’is called One-vs-All (typically called One-vs-Rest o, OVA type). The concept is to convert a multi-legance ae jnio|C’binary typ, hassle and construct C specific binary clacefiers. Here pick one magnificence ang teach a binary classifier with the samples of the selected class on one aspect ang different samples on the alternative aspect. Thus, it end up with C classifiers While testing, in reality classify the sample as belonging to the class with mos, rating among C classifiers. For example, if we've 9 training y< 1.2 three 61,2, three, we create copies of the original dataset and regulate them. In the primary reproduction, we update all labels now not equal to one by way of zero. In the second replica, we replace all labels now not identical to two with the aid of 0. In - the 1/3 reproduction, we update all labels not same to three by means of 0. Now we've got 3 binary classification problems wherein we must study to differentiate between labels 1 and zero; 2 and zero; and three and 0. Once we have the three fashions, to classify the brand new input function vector, we follow the 3 models to the center and we get three predictions. We then peicent the prediction of a non-0 class which is the most certain. Another approach is One-vs-One (OVO, additionally called All-as opposed to-All or AVA). Here, pick out 2 training at a time and educate a binary classifier using samples from the selected -training only (other samples, are disregarded on this step) : repeat this for all of the -elegance combos. Hence emerge as wi C(C-1)2C(C-1)2 range of classifiers. At prediction time, a vote casting scheme is camied out : All C(C-1)/2C(C-1)/2 classifiers are carried out to an unseen patter, and the elegance that were given the very best quantity of "+1" predictions gets expected by way of the combined classifier. All-as opposed-to-all has 2 tendency to be superior to at least one-versus-al. A problem with the previous schemes is that binary classifiers aro sensitive to errors. If any classifier makes any mistakes, it may have an effect on the vote remember. Tn One-vs-One scheme, every character mastering problem best includes a small subset of records whereas with Une-vs-All, the entire dataset is used for a range of training instances. TECHNICAL PUBLICATIONS” - an up-tust for knowledge a can consider One-v5-Rest (yp) classification algorithms at 3 gris approach especially splits the mui raises in order that the binary clgg i One-vs-Al(Oy, a A) a8 a technique to making i he as multiclass class algorithms. r e ie mation as binary classification binary type statistics. can be carried out to transform method of conversion of the data i maj Mi facts in which we have 3 lesson as Pematned ea » setosa F , Versicolor « Virginica ‘the transformed statistics as binary class informati Peis: formation will appear like the Setosa vs [Versicolor, Virginica] » Versicolor vs [Setosa, Virginica] 1 Virginica vs [Setosa, Versicoor] By searching on the conversion we are able to think that there’ irems fashions however with the large datasets growing 3 eran fc ef non-correct method to modelling, Here One-vs-Rest (OvR) or One-vs-All(OvA) involves store us in which binary classifiers may be skilled be are expecting any elegance as high-quality and different lessons as poor. We in particular use binary classifiers that may give membership chance or possiblity-like rankings due to the fact argmax of those scores can be applied to expect a class out ‘of a couple of classes. Let's see how we are able to put in force binary classifiers the use of the One-vs-Rest (OvR) or One-vs-All(OvA) approach for a multiclass category problem. 1 What ere variants of multiclas classification : One-s-One and One-re-All ? 2, Explain with example One-vs-One and One-vs-All Evaluation Metrics and Score tric for evaluating category models. Informally, accuracy is the # Accuracy is one met property. Formally, accuracy has the fraction of predictions our version were given following definition @ TECHNICAL PUBLICATIONS® - an uprivuat fr knowledge Ei # Correct Predictions i Accurey = ee -- equation ine the class has three goals * This jon includes all labels(targets). Imagine thr faa “AY, "BY and °C" skewed with 200, 30 and 20 data. If the predictions supply « hundred eighty, 20 and 10. Fventually, the accuracy may be eighty four%. Buty, ay see the accuracy does now not deliver an image of the way awful "Br °C" predictions are because of those have individual accuracy with sixty sing ay 50 %. You might suppose the gadget mastering version has eighty four% accura and itis applicable to the predictions but it isn't always. ! * Precision attempts to reply to the subsequent query : What percentage of supe, identifications changed into sincerely correct ? + Precision is described as follows : ere ~ TP+EP j Beale at TP- True Positive FP- False Positive FN-False Negative TN-True Negative Predicted Actual Fig. 4.6.1 Precision n recall * Precision returns positive prediction accuracy for the label and recall retums the (rue pusitive rate of the label. Precision price and recall value. If everyone asks “T need this Precision cost” you need to ask lower back “At what Recall cost". This controversy is another element that have to be discussed later. TECHNICAL PUBLICATIONS® - en ups for knowledge measure = On a Recall + Precision ween oe) i a | Here, the fee of Fmeasure(PL-score) o the model. j | 2+Recall Precision | | rect accuracy and bear in mind | , Now permit's see what recall and precision Precision simply approach, «Recall : Tt tells us what proportion of data belonging to lass A is assessed efficiently as in ma 7 a rea ee en say erificence aid of our classifier. « Precision : It tells us what percentage of : ge of facts that sifier has labelled i gure class, say magnificence A in reality belongs oi 3 ar ae icence <= 2. What is the FL score ? [Hd Micro-average Method In micro-average technique, you sum up the individual proper positives, fake positives and false negatives of the gadget for exceptional units and then apply them to get the statistics. For example, for a set of information, the system's True advantageous (IPI) = 12 False nice (FP1) = 9 | False terrible (FN1) = three Then precision (Pl) and recollect (RI) could be 57.14 and 80 and for ‘one-of-a-kind set of information, the gadgets | True effective (TP2) = 50 False fine (FP2) = 23 False bad (FN2) = 9 + Then precision (P2) and do not forget five + Now, Micro-common approach is (R2) will be sixty eight49 and 84Seventy the common precision and consider of the machine the use of the ; TECHNICAL PUBLICATIONS” - an ‘up-thnust for knowledge Scanned with CamScanner 4-24 -FP2) ° ision = 7P2)/(TP1+TP2+FP 1+) a Bes /(12+5049+23) = 65-Ninety six ‘Micro-average of don't forget = (Fp1sTP2)/(IPL TP2+ENI+FN2) y12+50+Three+9) = Eighty three Seventy eigh, = (250) iy the harmonic imply OF thase ig The micro-common F-score might be simp! The you sum up the man oF WEMEN Proper Posy : poe : of the system for exceptional sets ang Se Poa afc For eumpe fo ast of Satis, then device te ‘True tremendous (IPI) = 12 False effective (FP!) = nine False terrible (FN1) = three rf . . Then precision (PI) and bear in mind (Rl) could be Biffy seven 14 and eighty ang for a special set of information, the gadgets = ‘True tremendous (IP2) = 50 - False pasitive (FP2) = 23 False bad (FN2) = nine Then precision (P2) and recollect (R2) can be 68.49 and eighty four. ‘Now, the common precision and recall of the system the use of Machine Leaming Micoareage of precision = (TP1+1P2)/(IP1+IP2+FP1+FP2) ‘= (12+50)/(12+50+9+23) = 65 Ninety six Micro-common of consider = (TP1+TP2)/(IP1+TP2+FN1+FN2) = (12+50)/(12+50+Three+Nine) = 83.Sev * ‘The micco-average Fscore can be virtually the harmonic imply ‘Sgures. mw Macro-average * Macro-average approach can be used whilst you need to know performs usually across the units of facts. You have to no Ion, articular choice with this average * On the other hud, micro-common may be a useful measure whilst your dataset Varies in length. Soil * Tis method is simple Just take the comman of the precision and dot forget the System on specific sets. For example, the macro-average precision and forget of the gadget for the given example is "TECICAL PURUEATON nwt roc yoine Learning 4:25 ‘Superviad Leaming : Classification Macro-average precision = (PI+P2)/2 = (57.14+68Forty nine)/2 = Sixty two. Eighty two Macro-average don't forget = (RI+R2)/2 = (80+84.75)/2 = Eighty two.25 ESI Suitability + Macro-average approach can be used when you want to recognise how the gadget plays ordinary across the sets of information. You must no longer provide you with any unique decision with this common. + On the opposite hand, micro-common can be a useful measure while your dataset varies in size. LEI Cross Validation « In gadget studying, we couldn't healthy the model in the training data and may not say that the model will paintings appropriately for the actual records. For this, we must guarantee that our model were given the perfect patterns from the facts and it isn’t always getting up too much noise. For this reason, we use the cross-validation technique. Cross-validation is a method in which we educate our model on the usage of the subset of the facts-set after which compare the use of the complementary subset of the facts-set. © The 3 steps worried in move-validation are as follows = 1. Reserve a few parts of the sample statistics-set. 2. Using the relaxation statistics-set educate the version. 3. Test the version of the usage of the reserve part of the data-set. [EXE Methods of Cross Validation + In this technique, we carry out training on the 50 % of the given facts-set and the rest 50 % is used for the testing purpose. The principal drawback of this method is that we perform training on the 50 % of the dataset, it could be possible that the ultimate 50 % of the statistics contains some essential information which we are leaving whilst schooling our model ie. higher bias. Following are types of validation FETEN LOOCV (Leave One Out Cross Validation) * In this approach, we carry out training at the whole data-set but leaves most effective one data-factor of the to be had data-set and then iterates for each statistics-point. It has a few blessings as well as risks also. TECHNICAL PUBLICATIONS® - an up-thnst fo knowiedge Scanned with CamScanner Machine Leaming 4-26 Supervised Leaming : cig, oA gain of the usage of this method is that we make use of all facts points ang, this reason it's far low bias. The foremost disadvantage of this method is that it leads to better variation iy the trying out version as we are checking out in opposition to one data factop : the data factor is an outlier it is able to lead to higher variation. ANothe } downside is it takes a number of execution time as it iterates over ‘the numbe, al statistics factors’ instances, EERE] Fold cross Valiation + In this method, we cut up the facts-set into k range of subsets(called folds) «,.. | we perform education at the all the subsets but depart one(ok-l) subset for jy, | evaluation of the skilled version. In this technique, we iterate ok times with . distinctive subset reserved for testing cause each time. Note : * It is always suggested that the price of k have to be 10 as the lower price of k takes closer to validation and better value of okay leads to LOO. KEE] Applications of Cross-Validation * This technique may be used to evaluate the performance of different predict modelling strategies. * Ithas brilliant scope inside the scientific research discipline. + It also can be used for the metaevaluation, as it is already being utilised by the tecords scientists in the area of scientific statistics. Qo TECHNICAL PUBIICATIONS® - an uptirust or knowage Unsupervised Learning Syllabus K-Means, K-medoids, Hierarchical, and Density-based Clustering, Spectral Clustering. Outlier analysis: introduction of isolation factor, lacal outlier factor. Evaluation metrics and score: elbow method, extrinsic and intrinsic methods Contents 5.1 Introduction of Clustering 5.2 K-means eee ee eee eee es Maye19, Jume-22, +-++++* Marks 53 Hierarchical Clustering ...... eee dune-22, = Marks 8 5.4 Density-based Clustering 5.5 Outlier analysis 5.6 Evaluation Metrics and Score .. . .. May-19, June-22, sos Marks 4 Scanned with CamScanner 5-2 Unsuperviseg ‘Machine Leaming Introduction of Clustering What is cluster analysis ? Given a sct of objects, place ae similar (or related) to one ano} in other groups. ; | zi werful data-mining tool for any organisati, * Cluster analy can be 2 power ee sales transactions or other = ha needs to identify discrete SOUP insurance providers use cluster ana}, eS of behaviors and things. For example, ins 2 ysis ty detect fraudulent claims, and banks use it for credit scoring. i f similar : thematical models to discover groups © coe Saree among customers within each group, sy se the same class. In other yw; is of that belong #0 the same | Oras the ae ice ee cone cluster and dissimilar are grouped in oy. aes of di of «Clustering is a proces of pavttonng a set of data in a set of meaning subclasses. rs in the subclass shares 2 common trait. Tt helps a ice, | understand the natural grouping or structure in 2 data set. i i thods, hierarchi * Various of clustering methods are partitioning me i sucess lustering, Density-based clustering and Model -baseg clustering: Cluster analysis is process of grouping a set of data - objects into clusters. Desirable properties of a clustering algorithm are as follows : 1. Scalability (in terms of both time and space) 2. Ability to deal with different data types 3. Minimal requirements for domain knowledge to determine input parameters, 4. Interpretability and usability. © Clustering of data is a method by which large sets of data are grouped into clusters of smaller sets of similar data. Clustering can be considered the most important unsupervised leaming problem. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. Fig, 5.11 shows cluster. in groups such that the objects in a nt te trent from (or unrelated to) the a ae objec, & Fig, 5.1.1 Cluster TECHNICAL PUBLCATIONS® in uptht rknowodgo yoonine Learning Ds Unsupervised Learning + In ths i we easily identify the 4 clusters into which the data can be divided; ee ay criterion is distance : Two or more objects belong to the same luster if they are "close" according to a given distance (in thi ometrical distance This is called distance-based clustering. Sane n ; ¢ Clustering means grouping of data or dividing a large data set into smaller data sets of some similarity. «A clustering algorithm attempts to find natural groups of components or data based on some similarity. Also, the clustering algorithm finds the centroid of a group of data sets. ‘© To determine cluster membership, most algorithms evaluate the distance between a point and the cluster centroids. The output from a clustering algorithm is basically a statistical description of the cluster centroids Wwith the number of components in each cluster. ‘ «Cluster centroid : The centroid of a cluster is a point whose parameter values are the mean of the parameter values of all the points in the clusters. Each cluster has a well defined centroid, + Distance : The distance between two points is taken as a common metric to as see the similarity among the components of a population. The commonly used distance measure is the euclidean metric which defines the distance between two points p = (py, p2/--) and q = (41, 42.) is given by: > av? ind + The goal of clustering 1s to determine the intrinsic grouping in a oct of unlabeled data. But how to decide what constitutes a good clustering ? It can be shown that there is no absolute best" criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs. Clustering analysis helps construct meaningful partitioning of a large set of objects. Custer analysis has becn widely used in numerous applications, including pattern recognition, data analysis, image processing, etc d TECHNICAL puBLicaTions® = an up-thrust for knowledge Scanned with CamScanner a Unsupenisa ae Lea eT * Clustering algorithms may be classified as listed below 1. Exclusive clustering 2. Overlapping clustering 3. Hierarchical clustering 4, Probabilistic clustering ' wee ‘A good clustering method will produce high quality clusters with high intra.g similarity and low inter-class similarity. The quality of a clustering result dj cn both the similarity measure used by the method and its implementation, quality of a clustering method is also measured by its ability to discover some 4, all of the hidden pattems. Examples of Clustering Applications is their customer p; 1. Marketing : Help marketers ‘discover distinct groups in — and then use this knowledge to develop targeted marketing programs. 2, Land use + Identification of areas of similar land use in an earth observation database. - 3. Insurance : Identifying groups of motor insurance policy holders with a higy | average claim cost. 4, Urban planning : Identifying groups of houses according to their house type | value, and geographical location. | 5. Seismology : Observed earth quake epicenters should be clustered along continent faults. ESRI Typical Requirements of Clustering in Data Mining 1. Scalability : Many clustering algorithms work well on small data sets. 2. Ability to deal with different types of attributes : Many algorithms are designed io chister interval-based data. 3. Discovery of clusters with arbitrary shape : determine clusters based on Euclidean or Manhattan distance measures | 4 Minimal requirements for domain knowledge to determine input Parameters: Many clustering algorithms require users to input certain parameters-in cluster analysis, 5. Ability to deal with noisy data : Most real- world databases contain outliers or missing, unknown or erroneous data. Some clustering algorithms are sensitive (o such data and may lead to clusters of poor quality. 6 Incremental clustering and incensitivity to the order of input recurds : Some Slustering algorithms cannot incorporate newly inserted data into existing clustering structures. TECHNICAL PUBLICATIONS® - an upxtrus for krowedge | Many clustering -algorithms Leaming ee 6-6 Unsupervised Leaming 7, High dimensionality : A database or a data warehouse can is dimensions or attributes, 7 ie eee 8, Constraint-based clustering ; Real-world i ; + Reals applicat clustering under various kinds of conan we ge 9, Interpretability and usability : Users : ect lust sults interpretable, comprehensible and usable, rel a [5.1.2] Problems with Clustering 1. Current clustering techniques do not address all the requirements adequately 2. Dealing with large number of dimensions and large number of data items can be problematic because of time complexity ; 3, The effectiveness of the method depends on the definition of "distance" ; 4. If an obvious distance measure doesn't exist we must "define" it, which is not always easy, especially in multi-dimensional spaces ; 5. The result of the clustering algorithm can be interpreted in different ways. ERE} Types of Clusters «Type of clusters are as follows : 2) Well - separated clusters b) Prototype - based clusters ©) Contiguity - based clusters 4) Density - based clusters a) Well - separated clusters : «A cluster is a set of points such that any point in a cluster is closer to every other point in the cluster than to any point not in the cluster. + Fig, 5.1.2 shows well-separated cluster. Fig. 5.1.2 Well-separated cluster TECHNICAL PuBLICATIONS® ‘an up-thrust for knowledge Scanned with CamScanner on ‘Machine Leaming city that all the objects in a ae , 10 570 + Sometines a theo sw wd 6 hen ‘tion of a cluster is satisfied on), ther. Definition 0! ly sufficiently close to one sn! daia contains natural clustes 1) Prototype - based ne such that an obpetin 8 UT 1S OST (oy, | * A cluster is a set Of | a Custer, than to the center of, any otng, iy to the prototype or “cent ee referred to as "Center-Based" Clusteys, iy | tors can als Prototype based ss f is often a centroid, the average of all the Points jy | : Ree es he mast represetauve” point of © cloter, Fig, 515 ster, ot a 4 center-based cluster. Fig, 5.1.3 4 Center-based clusters & Shon, | + Tf the data is numerical, the prototype of the cluster is often a centroid i i ts | average of all the points in the cluster. + Ifthe data has categorical attributes, the prototype of the cluster is often a mada, | i.e, the most representative point of the cluster. | © Objects in the cluster are closer to the proiotype of the cluster than to yy prototype of any other cluster. + KeMeans and K-Medolds are the exaunples of prototype based clustering algoritiy ©) Contiguity - based clusters +A cluster isa set of poins such that a point in a cluster is closer (or more simi | to one or more other point in the cluster than to any point not in the cluster, Fig. 5.1.4 8 contiguous clusters TECHNICA, PuoLIeATIONS + an upirust for knowiedge oaring eT tp orig pensity - based + A cluster is a dense region of points, which from other regions of high density, ' separated by low-density regions, « Used when the clusters are irregular or intertwined, and when noise and outliers are present, pian He RE] Desired Features of Cluster Analysis © Features are as follows : ft Scalability : Data-mining problems can be large and therefore a cluster-analysis method should be able to deal with large problems gracefully. Ideally, performance should be linear with data-size 2. Only one scan of the dataset : For large problems, data must be stored on disk, so cost of 1/O disk can become significant in solving the problem. 3. Ability to stop and resume : For large dataset, cluster-analysis may require hhuge processor-time to complete the task. In such cases, the task should be able to be stopped and then resumed when convenient 4, Minimal input parameters : The method should not expect too much guidance from the data-mining analyst. 5, Robustness : Most data obtained from a variety of sources has errors. ‘Therefore, the method should be able to deal with noise, outlier and missing values gracefully. 6. Ability to discover different clustershapes : Clusters appear in different shapes and not all clusters are spherical. So the method should be able to discover cluster-shapes other than spherical. 7, Different data types : Many problems have a mixture of data types, for eg. numerical, categorical and even textual, Therefore, the method should be able to deal with numerical, boolean and categorical data 8, Result independent of data input order ; The method should not be sensitive to data input-order. TECHNICA! PUBLICATIONS® - an upthnst for knowledge Scanned with CamScanner cluster is represented by the Stands for number of clusters, it is typically a user need riteria can be used to automatically estimate K. center of the cluster, "K" to the algorithm ; some « uunGl all the components are grouped into the final required number of clusters, * Given K, the K-meais algorithm consists of four steps : 1. Select initial centroids at random. 2. Assign each object to the cluster with the nearest centroid. 3. Compute each centroid as the mean of the objects assigned to it. 4. Repeat previous 2 steps until no change * The x1, .., xy are data points or vectors of observations. Each observation (vector x;) will be assigned to one and only one cluster. The C(j) denotes cluster number for the observation. K-means minimizes within-cluster point scatter : « WO =5 9 DD ii-ail? K~10()=KC{)-K x = Nk DY isi-mel?? Ka coeK where, mx is the mean vector of the K™ cluster. Nj is the number of observations in K" cluster. means algorithm properties 1. There are always K clusters. 2, There is always at least one item in each cluster. 3, The clusters are non-hierarchical and they do not overlap. TECHNICAL PUBLICATIONS® - an up-hnist for knowiedge Scanned with CamScanner Unsupervised Mechine Leaming Wy to its cluster than any other cluster beg Nise 4, Every member of a cluster 5 doser TT escent does nakakeaye inte Be =I of lust The K-means algorithm process Ps ss partitioned into K dusters and the Cala Polnts are rang 1, The dataset is parti 3 that have roughly the same iy assigned to the clusters resulting in clusters ibe of data points. 2 Fk ach date int, ‘or each data point Bee ena a. Calculate the distance from the data own cluster, leave it where it is. . If the data point is closest to its | c. If the data point is not closest to its own cluster, anove it into the cjg, cluster. | 43. Repeat the above step until a complete pass through all the data points resyi no data point moving from one cluster to another. At this point ‘the clusters are ee stable and the clustering process ends. k cal 4. The choice of initial partition can greatly affect the final clusters terms of inter-cluster and intracluster distances and cohesion. © Kemeans algorithm is iterative in nature. It converges, how minimum is obtained. It works only for numerical data. Thi implement Advantages of K-means algorithm : 1. Efficient in computation 2. Easy to implement Weaknesses : 1. Applicable only when mean is defined. 2. Need to specify K, the number of clusters, in advance. 3. Trouble with noisy data and outliers. 4 Not suitable to discover clusters with non-convex shapes. ERAT K-medoids Leaming potest + The most common realisation of K-medoid ch Medoids (PAM) algorithm. PAM uses a greedy search optimum solution, but it is faster than exhaustive search. eae + Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 10 lustering is the Partitioning Around 0123456786810 012345678910 Fig. 5.2.2 + A medoid can be defined as that object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal ie. it is a most centrally located point in the given data set. 1. The algorithm begins with arbitrary selection of the K objects as medoid points out of m data points (n > K) ; 2. After selection of the k medoid points, associate each data object in the given data set to most similar medoid. The similarity here is defined using distance measure that can be euclidean distance, manhattan distance or minkowski distance 3, Randomly select nonmedoid object 0’ 4, Compute total cost, S of swapping initial medoid object to O' 5, If <0, then swap initial medoid with the new one (if $ < 0 then there will be new set of medoids) 6, Repeat steps 2 to 5 until there is no change in the medoid. [EEF] Difference between K-mean and K-medolds Clustering ES a 7 K-means y K-medoids fs ‘luster {s represented by the center of the Each cluster is represented by one the” luster objects in the cluster. i: | Simple Centroid-based method. TECHNICAL, PUBLICATIONS® - an up-thrust for knowledge Scanned with CamScanner in tn i Scanned with CamScanner ae should be smaller © Each individual's distance to its own cluster mean maler than distance fo te olber case's mem whch isnot ths cse wil iid 9° Thus, individual 3 is relocated to cluster 2 resulting in the new partition EE] Hierarchical Clustering ‘© Hierarchical clustering arranges items in a hierarchy with a based on the distance or similarity between them. The gr the resulting hierarchy is a tree-structured graph called a d * The tree is not a single set of clusters, but rather a clusters at one level are joined as clusters at the next level. © The hierarchical clustering algorithm is an unsupervised technique. © Hierarchical clustering starts with k = N clusters and proceed | Closest objects into one cluster, obtaining k = N-1 clusters. The two clusters to obtain k-1 clusters is repeated until we reach the of clusters K. * Fig, 53:1 shows type of hierarchical clustering, Divisive and two types of hierarchical clustering. Fig. 5.3.1 Types of hierarchical clustering | | fl Advantages and Disadvantages of Hierarchical Clustering 1, Advantages | « It 4s simple to implement. * It is easy and results in a hierarchy, a structure that contains more information. «It does not need us to pre-specify the number of clusters. 2. Disadvantages It breaks the large clusters. It is Difficult to handle different sized clusters and convex shapes. It is sensitive to noise and outliers. + The algorithm can never be changed or deleted once it was done previously. [EEZ Divisive Methods + Divisive clustering is known as the top-down approach. Divisive methods initialize with all examples as members of a single cluster, and split this cluster recursively. * Start at the top with all documents in one cluster. The cluster is split using a flat clustering algorithm. This procedure is applied recursively until each document is in its own singleton cluster. ‘It subdivides the clusters into smaller and smaller pieces, until each object form a cluster on its own or until it satisfies certain termination conditions, such as a desired number of cluster or the diameter of each cluster is within a certain threshold. + The divisive hierarchical clustering, also known as DIANA (Divisive ANAlysis) is the inverse of agglomerative clustering. * Di ve clustering is good at identifying large clusters. TECHNICAL PUBLICATIONS® - an up-hnist for knowledge Scanned with CamScanner Hate oars sd et EER] agglomerative Clustering This also known as AGNES (Agglomerative Ne manner. a . single-lement cluster (leap, 4, © That is, each object is initially considered 28 = ent _ ose te tv csters that are the most simlar are combing igger cluster (nodes). Eats bigger is iterated until all poin's are member of just one single big = (root). Teens is a tree which can be plotted as a dendrogram i : belong to this category hierarchical clustering methods a «Initially, AGNES places each objects into a cluster of its own. The clusters are g, : terion. merged step-by-step according to some ci / : For example, cluster C, and C2 may be merged if an object in Cy and objec Co form the minimum Euclidean distance between any to Objects from ites, ing) Te works in « bot, ‘clusters. In the agglomerative hierarchical approach, we start by defining each data point be a cluster and combine existing clusters at each step. Here are four methods for doing this 1. Single linkage-: Smallest pairwise distance between elements from ea cluster. It tends to produce long, "loose" clusters. 2 Complete linkage : Largest distance between elements from each cluster. j tends to produce more compact clusters 3. Average linkage : The average distance between elements from each cluster, jy can vary in the compactness of the clusters it creates. 4. Centroid linkage : Distance between cluster means. Agglomerative clustering is good at identifying small clusters. * Steps for an agglomerative hierarchical cluster analysis. 1. Find the similarity or dissimilarity between every pair of objects in the data set. In this step, we calculate the distance between objects using the pdist function. The pdist function supports many different ways to compute this measurement. 2. Group the objects into a binary, hierarchical cluster tree. In this step, we link pairs of objects that are in close proximity using the linkage function. The linkage function uses the distance information generated in step 1 to determine the proximity of objects to each other. As objects are paired into binary clusters, the newly formed clusters are grouped into larger clusters until a hierarchical tree is formed. AL PUBLICATIONS®- an upsttust for knowledge * yeemleemig Sa nporint ening 3. Determine where to cut the hierarchical tree into clusters. In this step, we the cluster function to prune branches off the bottom of the heal = D and assign all the objects below each cut to a single cluster. This creates . partition of the data. The cluster function can create these clusters by detecting 4 natural groupings in the hierarchical tree or by cutting off the hierarchical tree io at an arbitrary point. aI pa Difference between Agglomerative and Divisive Clustering + Agglomerative dustering is good identifying small clusters. 5 Each cluster starts with only one object. ee fF & Ifis also known as AGNES t (Sgglomerative Nesting). [5 -Tteratively clusters are merged together. (GREREED cesiter Giz diensionad den et (7, 10, 20, 28; 361, perform laerrecal clustering ard plot the dendrogram to visualize it. See Large clusters are successively di | Solution : Draw the graph using data set. | 1.007 | 0757 0504 0254 | 0.004 | 0254 0504 0.75 1.004 Fig. 5.3.2 TECHNICAL PuBLICATIONS® = an upthrust for knowledge —— Scanned with CamScanner Fr | gone 00g . = 19 yn Unsuporicod Looming _joomlets linkage : Complete linkage mae DiGi iss fs tae 1 2) (7,10) 20 2845 SNe, (3 Sane 3) (7,10) 20 (28,5) SAY, 3 15 4) (710,20) (28,35) \y 2 | | 28: | 13 7 3 710 20 28 35 (Dendrogra) Fig, 5.3.4 + Using complete linkage two clusters are formed Cluster 1 : (7, 10, 20), Cluster 2 : (28, 35) | Ee 1, What is agglomerative clustering ? Explain with example E24 Density-based Clustering + Density-based clustering is unsupervised learning methodologies used in model building and machine learning algorithms. This is a clustering method that is used in machine learning to separate clusters of high density from clusters of low density. TECHNICAL PUBLICATIONS® - en up-trust fr knowiedgo ed Scanned with CamScanner Machine Leeming 5-20 i of Applications wy ie "Density-Based Spatial Clustering With Noes DBSCAN stands points. DBSCAN groups together closely-packed There are two inputs to DBSCAN + 1, The search distance around point (©) 2. The minimum number of points ey vase and “The 2 s ees ‘DBSCAN is based on this intuitive not eee that for each point of a cluster, the neighborhood of a given 28 1 a at least a minimum number of points: ; ‘The parameter (2) defines the radius of neighborhood ver a point x. It ig the eneighbarhood of x, The parameter MinPts is the minimum number neighbors within "eps" radius. ‘Any point x in the dataset, with a neighbor count greater than or equal to Mj, Rae point We say that x is border point, ifthe number of neighbors is less than MinPs, but it belongs to the Eneighborhood of some cot point z, Finally, if a point is nether a core nor a border point, then itis catia « noise point or an outlier. Fig. 54.1 shows DBSCAN. fs) required to form a density cag Fig. 5.4.1 DBSCAN We define following terms for understanding the DBSCAN algorithm : 1, Direct density reachable : A point "A" is directly density reachable from another point “B” if : "A" is in the Eneighborhood of "B” and "B" is a core point. 2. Density reachable : A point "A" is density reachable from "B" if there are a se of core points leading from "B" to "A. 3. Density connected : Two points "A" and "B” are density connected if there are a core point °C’, such that both "A" and B" are density reachable from "C° DBSCAN is broadly used in many applications such as market research, pattem Tecognition, data analysis, and image processing. TECHNICAL PUBLICATIONS® - an up-thnust for knowledge Unsupervised Leaming { ga Advantages and Disadvantages of DBSCAN intages * | 4) We don't need to specify the number of clusters, }) Hexbility in the shapes & sizes of clusters, «) Able to deal with noise and outliers. 4) Ability to identify uneven shapes. ¢) Itis easy for someone who knows the dataset, to set the parameters, Bere: __ 1, Input parameters may be dificult to determine 2, In some situations very sensitive to input parameter setting. 3, Can be confused when there is a border point that belongs to two clusters. 4, Results depend highly on the distance metric, 5. Can be hard to guess the correct parameters for an unknown dataset. . E2E) Spectral Clustering © Spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset + Given data points x;, ... xy, pairwise affinities Ay = A(x, x) © Build similarity graph shown in Fig. 542. © Clustering = Find a cut through the graph © Define a cut-type objective function. + The low-dimensional space is determined by the data. Spectral clustering makes Fig. 8.4.2 Similarity graph use of the spectrum of the graph for dimensionality reduction. + Projection and clustering equates to graph partition by different min-cut criteria. TECHNICAL PUBLICATIONS® - an up-hrust for knowiedge Scanned with CamScanner Uncupervise 5-22 porvised, rn ‘Machine Leaming oe the clusters. Ese 4. Does not make strong assumptions on the statistics 2. Easy to implement. 3. Good clustering results. ‘4 4, Reasonably fast for sparse dala sets of several thousand element Disadvantages : 1. May be sensitive to choice of parameters. 2. Computationaly expensive for large datasets. [EGF outlier Analysis © A database may contain data objects that do not comply with the general behaviour model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers. © Outlier detection is the process of detecting and subsequently excluding outliers from a given set of data. * Fig. 5.5.1 shows outliers detection. Here Fig. 5.5.1 Outliers detection O1 and O2 seem outliers from the rest. An outlier may be defined as a piece of data or observation that deviates drastically from the given norm or average of the data set. An outlier may be caused simply by chance, but it may also indicate measurement error or that the given data set has a heavy-tailed distribution. ‘© Objective : Define what data can be considered as inconsistent in a given data set + Outlier analysis is used in various types of dataset, such as graphical dataset, numerical data set, text dataset and can also be used on the pictures etc. * The identification of outlier can lead to the discovery of useful and meaningful knowledge. Outlier detection is the process of finding data objects with behaviours that are very different from expectation. Such objects are called ontlier or anomalies. * Finding outliers from as collection of pattem is a popular problem in the field of data mining. A key challenge with outlier analysis and detection is that it is not a well formulated problem like clustering so outlier detection as a branch of data mining requires more attention, TECHNICAL PUBLICATIONS® - an up-tmus or knowledge eaming P out pe and detection has various applications in numerous fields such as fraud detection, credit card, discoverin intrusi ; 7 l B computer intrusion and criminal pehaviours, medical and public health outli ion, i i ; ‘Siem ; jer detection, industrial damage + General idea of application is to find out data which devi normal hich behaviour of data set. on ae eee [BI Statistical Distribution-based Outlier Detection «The statistical approach assumes that data follows some standard or predefined distribution or probability model, and aims to identify outliers with respect to the model using a discordance test. «A discordance test is used to detect whether a given object is an outliers or not. + The general idea behind statistical methods for outlier detection is to lear a generative model fitting the given data set, and then identify those objects in Jow-probability regions of the model as outliers. «Statistical methods perform poorly on high-dimensional data. Statistical methods for outlier detection can be divided into two major categories. 1. Parametric methods : © Model using parametric technique grow only with model complexity not data size. ‘Assume that the normal data objects are generated by a parametric distribution. Regression, scatter-point method are popular parametric method. «This Fig, 55.2 shows scatter-point method to detecting outliers. x © Oulers Fig, 5.5.2 Scatter-point method to detecting outliers 2, Nonparametric methods : + The model of normal data is cone a priori. It do not make any + Kemel feature space are methods of non-parametric techniques. Density-based approach and deviation-based approach for numerical data Frequency based approaches have been defined to detect outliers in categorical data learned from the input data, rather than assuming y assumption about statistical distribution of data. TECHNICA! PUBLICATIONS® - an up-hust for knowledge Scanned with CamScanner e ith that identifies the outliers presen é Local Oates Fac (LOR) son S85 E peti llorbasc) depths ee Ela dws to defve ou ty ane density based scoring, It is similar tothe KNN (nearest neighbor search) alpory, @ * ‘The Local Ones Face (LOR) lgrien is em ensupervsed eromaly tection method which computes the local density deviation of a given data point 1.¢ Tespect to its neighbors. ; 5 The LOF algoritium can be used for outlier detection and novelty detection 7, diference between oar detection and novelty detection les in the training ataset. Outlier detection includes outliers in the training data Novelty detection only includes the normal data points when taining the moja The mel wl te tne es ith nn fo pedo Th in novelty detection are also called novelties * The local outlier factor method works by comparing the density with the relative densities of its neighbours. If a point is relatively neighbours, it is a potential outlier. I the ratio of densi density of point is too high, we end up with 2 high LOF + We first define the K-istawe of a point K-listane(A) = D; Teangg eighbour). Bs a . ie oe “Point Ny(A) = (P| Dist(A,F)<= K-distance(A)}. é * Reachability Distance : it expresses the maximum of the distance ‘points and the k-distance of the second point. The distance between the nts here Reachability Distance, (A,B) = max{k-distance(B), dist(A,B)) ay * The local reachability densities found are compared to the Tocal_ reacabilty densities of a's nearest k neighbors. The density of each neighbor is summed up and divided by the density of a The value found is divided by neighbors ie. k. (arpa *. “neighhor) + LRD(2"4 + LOF(a) = t ene = Unsupariised Learing gp Evaluation Metrics and Score ees gal Homogeneity + Homogeneity metric of a cluster labeling gi . 7 : Given a ground truth A clustering result satisfies homogenity if all of its clusters cota only data points whch members of a single class. a « This metric is independent of the absolute values of the labels : A permutation of the class or cluster label values won't change the score value in any way. To define the concepts of entropy H(X) and conditional entropy H(X|¥), which measures the uncertainty of X given the knowledge of Y. Therefore, ifthe class set is denoted as C and the cluster set as K, H(C|K) is a measure of the uncertainty in determining the right class after having clustered the dataset. To have a homogeneity score, it's necessary to normalize this value considering the initial entropy of the class set H(C) : H(CIK) HO) In scikitlearn, there's the builtin function homogeneity_score() that can be used to compute this value : from sklearn.metrics import homogeneity_score ” [EGE completeness + A-complementary requirement is that each sample belonging to a class is assigned to the same cluster. ‘A dustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster. «This metric is independent of the absolute values of the labels : A permutation of the class or cluster label values won't change the score value in any way. This measure can be determined using the conditional entropy H(K|C), which is the uncertainty in determining the right cluster given the knowledge of the class. Like for the homogeneity score, we need to normalize this using the entropy Hq): h = 1- We can compute this score (on the same datasel) using the function completeness_score() : from sklearn.metrics import completeness_score TECHNICAL PUBLICATIONS? = an up-thrust for knowledge Scanned with CamScanner Unsuper ne UPerviseg | Machine Leeming EEX] Adjusted Rand Index re - i ‘it tween the ori; © The adjusted rand index measures the similarity Fa lg Partitioning (Y) and the clustering. * If total number of samples in the dataset is a+b | neue () f objects that are either in th, | cen ie ee ae divided by the total number 3 ranneieead | * The Rand index lies between 0 and 1. | | n, the rand index is defined as , * When two partitions agree perfectly, the Rand index achieves the maximun, value 1. + A problem with Rend index is that the expected value of the Rand index between two random partitions is not 2 constant ; * This problem is corrected by the adjusted Rand index that assumes the generalizes +hyper-geometric distribution as the model of randomness. * The adjusted Rand index has the maximum value 1, and its expected value is 0 in the case of random clusters. * A larger adjusted Rand index means a higher agreement between two partitions, The adjusted Rand index is recommended for measuring agreement even when the Pattitions compared have different numbers of clusters. Silhouette * Silhouette refers to a method of interpretation and validation of clusters of data, | * Silhouettes are a general graphical aid for interpretation and validation of cluster | analysis. This technique is available through the silhouette function. In order to | calculate silhouettes, two types of data are needed : 1° The collection of all distances between objects. These distances are obtained | from application of dist function on the coordinates of the elements in mat , with argument method. | 2, The partition obtained by the application of a clustering technique. * Tor each element, a silhouette value is calculated and evaluates the degree of | confidence in the assignment of the element : 1. Well-clustered elements have a score near 1, | 2. Poorly-clustered elements have a score Near ~ 1, | TECHNICAL PUBLICATIONS? . an up-tiist for knowledge points, as well as clusters and clustering's, For an individual point, I a = Average distance of i to the points in the same cluster b = min (average distance of i to points in another cluster) Silhouette coefficient of i : Silhouette coefficient © Cohesion : Measures how closely related are objects in a cluster. © Separation : Measure how distinct or well-separated a cluster is from other clusters. [EGET Elbow method «The elbow method is used to determine the optimal number of clusters in kmeans clustering. The elbow method plots the value of the cost fiunction produced by different values of k. «If k increases, average distortion will decrease, each cluster will have fewer constituent instances, and the instances will be closer to their respective centroids. However, the improvements in average distortion will decline as k increases. The value of k at which improvement in distortion declines the most is called the elbow, at which we should stop dividing the data into further clusters. «It involves running the algorithm multiple times over a loop, with an increasing, number of cluster choice and then plotting a clustering score as a function of the number of clusters. + The Elbow and Silhouette methods are the two state-of-the-art methods used to identify the correct cluster number in the dataset. + ‘The Elbow method is the oldest method to distinguish the potential optimal cluster number for the analyzed dataset, whose basic idea is to specify K = 2 as the initial optimal cluster number K, and then keeps increasing K by step 1 f0 the TECHNICAL PUBLICATIONS® - an up-hrst for knowledge Scanned with CamScanner Scanned with CamScanner ‘Scanned with CamScanner Introduction to Neura Neb acti Lochs eae IGRI Antiticial Neural Networks Introduction to Neural Networks ivid i It implic “Neural * conjures up many vivid images. It implies dey, . a re ae be Toaded with science fiction overtones Rent ie = Frankenstein mythos. What are neural networks ? i i te Processing “nodes » * A neural network is an interconnected collection of discrete processing "nodes,» units, whose operation is somewhat analogous to that of an animal neuron, The interunit connection strengths, or weights, acquired through a process gy adaptation to, or learning from, a set of training patterns, are where the networ)., Processing power is kept. that make up the hy The estimated 10 (100 billion) nerve cells or neurons human brain are depicted in Fig. 61.1 in a highly stylised manner. Electrical signajs which are brief impulses or "spikes" in the voltage of the call wall or membiane, are used by neurons to communicate. Electrochemical junctions called synapses, which are found on cell branches known as dendrites, mediate the interneuron connections. he oo aes => Cell body = be Spalie — >> pw ee E> ouput Fig. 6.1.1 The fundamental elements of a neuron * Each neuron typically has thousands of connections to other neurons, which results in a continual influx of messages that finally reach the cell body. Here, they are combined or integrated in some manner, and the neuron will fire” or produce TECHNICAL PUBLICATIONS® - an up-thnust for ‘knowledge Leming Introduction to Neural Networks a voltage impulse in response if the resulting signal i si predetermined threshold. The axon, a fae i enna i ; communicate this to other neurons. : cone Neural networks with multiple modules and sparse connectivity between them are the best solution for many situations. There are numerous waYs to structure modules, iswluding hierarchical organization, successive refinement, input flexibility. 3 ae A network is a structure like a graph, while "neural" is an adjective for a neuron. The terms "artificial neural networks," "artificial neural systems,” "parallel distributed processing systems" and "connectionist systems" are also used to describe artificial neural networks. ‘ ‘A computer system must have a labeled directed graph structure with nodes that carry out certain basic operations in order for it to be referred to by these elegant labels. A "Directed Graph’ is made up of a collection of "nodes" (vertices) and a collection of "connections" (edges /links arcs) that join up pairs of nodes. ‘A graph is referred to as a “labeled graph” if each link has a label to define a connection attribute. X,e(0.1} | [axon (0 =X, AND X, Fig. 6.1.2 AND Gate graph + Since the connections between the nodes are fixed and appear to have no other purpose than to transport the inputs to the node that computed their conjunction, this Fig. 6.2 graph cannot be regarded as a neural network, TECHNICAL PUBLICATIONS® - an up-ust for knowtede 5 Introduction to Metmanig EO (=X, AND Xp Xe Fig, 6.1.3 AND Gate network j + Fig. 613, the computing system meets the definition of an artifici | network since it has a graph structure that connects weights that can be net using @ learning algorithm. = FEET Biological Neuron Model There are four parts of the typical nerve cell shown in Fig. 6.14 1. Dendrites : Accepts the inputs 2. Soma : Process the mputs. Basic 3. Axon : Turns the processed inputs into output 4. Synapses : The electrochemical contact between the neurons B 3. 4. 5 3, These products are then simy added 0g futon, {0 to puke ee 1 Fee rong eo output. a O wy, % mz O39 — 03,509 x— OY" Fig, 6.1.5 Artificial neuron model Terminology Biological terminology : ANN terminology ‘Neuron : Neurode or cell or unut or node Synapse : Edge or connection or link Synaptic efficiency : Weight or connection strength Firing frequency : Node output ial Neural Network (ANN) uction : ‘An all-encompassing, useful technique for learning real-valued, discrete-valued, and vector-valued functions from examples is provided by artificial neural networks (ANNs). Gradient descent is used by algorithms like backpropagation fo adjust network parameters to best suit a training set of input-output pairs. ANN learning has been effectively used to solve issues including uderstanding visual sceneries, speech recognition, and learning robot control strategies because it is robust to faults in the training data “Arbficial neural neworks (ANNs) are computer programs that aim to address any issue by imitating the composition and operation of the nervous system Gimulated neurons serve 28 the foundation of neural networks, which are constructed in a variety of ways. In the following two respects, neural networks and the human brain ere similar : A neural network learns new information. © The synaptic weight, a measure of the connectvity strengtle, is where & soik stores ite Lnowledge noural Scanned with CamScanner Machine Leaming Fig, 6.4.6 Biological neural network © The biological neural networks that shape te stucture See brain ary Where the pluase "artificial neural network” originates. Artificial noural eto also feature neurons that are interconnected to one another in different lovel, of the networks, much like the Auman brain, which has neurons thay interconnected to one another. Nodes are the name for these neurons. ‘© The biological neural network's typical diagram is shown in Fig. 6.1.6, + The given Fig. 6.1.7 represents a typical artificial newal snctwork, ee Fig. 6.1.7 Artificial neural network | © In artificial neural networks, dendrites from Biological neural networks serve as inputs, cell nuclei serve as nodes, synapses serve as weights and axons sorve as outputs. . * Arnficial and biological neural networks are related to | Biological eural network Artificial neural network Dendiites 3 TECHNICAL PUBLICATIONS® . an up-thrust for knowledge Learning. 6-7 ee re Astificial neural networks are used in artificial i r in artificial intelligence to sit of neurons that make up the human brain, giving ee kee comprehend information and make decisions in a manner similar fae person. Computers are programmed to act simply like interconnected brai le . create an artificial neural network. ae + Consider ait example of a digital logic i gate that accepts input and we may better grasp the artificial neural network. ne aa © Two inputs are required for the "OR" gate. If ei i are "On," the output will also be’On” Saal If both inputs are “Off the output wi oa dependent on input. seen eo Rca eee aaa Our brains do not carry out the same function. Because our brain's neuréns are constantly "learning," the relationship between outputs and inputs is constantly changing [RED Architecture of ANN Model om] © © [Hidden layer 4} a Fig, 6.1.8 Architecture of Artificial Neural Network 4. Input layer + As its name implies, the input layer accepts inputs from the programmer in a variety of different formats. 2, Hidden layer : The hidden layer is displayed between the input and output layers. It makes all the computations necessary to uncover pattems and buried features. 3, Output layer : This layer is used to communicate the output after the input has undergone a number of alterations in the hidden layer. TECHNICAL PUBLICATIONS® = an up-thrust for knowledge Scanned with CamScanner oe es Introduction to Nou Belek SET 1 rr. ae th artificial neural network computes the Weighted gy eae bias. A transfer function 5 used t0 vii 9 inputs an computation. 2, yy Wy*Xi +b fal ' it passes the weighted total as an inpyp to produce the output, it passes the i to ae a ‘A node’ activation functions determine whether gy should fire, The output layer is only accessible to individuals who are fired. Depending gy type of task we are completing, there are many activation functions that gay used. [ERY Advantages and Disadvantages of ANN ificial Neural Network (ANN) lee Nei ees capability : Artificial neural networks have a numerical ya that allows them to carry out multiple tasks at once. . Storing data on the entize network : Traditional programming does not employ database; instead, it stores data on the entire network. The network continue. function even if some data disappears from one location temporarily. . Capability to work with incomplete knowledge : After ANN training, the da may sill produce output even with insufficient data. The relevance of the mis data in this situation is what causes the performance loss. Having a memory distribution : Determining the instances and motivating ty network: in accordance with the intended output by showing it these examples crucial for ANN to be able to adapt. The network's output can be false if i event can't be represented by the network in all of its characteristics because the network's succession is directly proportional to the selected occurrences, 5. Having fault tolerance : The network is fault-tolerant since expropriation of one of more ANN cells does not prevent the network from producing output, Disadvantages of Artificial Neural Network : 1. Assurance of proper network structure : The construction of artifical news networks is not determined by any specific rules. Through experience, tral and error, the right network structure is achieved, 2 Unrecognized behaviour of the network : It is the most important ANN issue When an ANN generates a testing solution, it doesn't explain why ot how. lt erodes network confidence. 5. Hardware dependence : According to their structure, artificial neural networks require processors with parallel processing power. As a result, the equipments realisation is dependant. TECHNICAL PUBLICATIONS® . ‘an up-thrust for knowledge yma __69 rset 4, Difficulty of showing the issue to the network : ANNs can Process data that is numerical. Before using ANN, problems must be transformed into numerical yalues. The network's performance will be directly impacted by the presentation mechanism that must be decided here. It is dependent on the user's skills. 5, The duration of the network is unknown : The network is reduced to a particular error value, and this error value does not produce the best outcomes for us 4 Science artificial neural networks, which first appeared in the world in the middle ' of the 20th century, are growing quickly. Currently, we have looked into the benefits of artificial neural networks and the problems that can arise when using he them. It should not be forgotten that the disadvantages of ANN networks, a be burgeoning scientific field, are being eradicated one at a time as their advantages are growing. It implics that artificial neural networks will increasingly play a crucial role in our lives and become indispensable, [EEE Working of ANN * How do artificial neural networks work 7 « The ideal way to visualise an artificial neural network is as a weighted directed graph, where the nodes are the artificial neurons. The directed edges with weights represent the relationship between the neuron. ae [F# inputs and outputs ‘The input signal for the artifical neural network comes from an external source as a pattern and an image as a vector. Then, for each n-th input, these inputs are mathematically assigned using the notation x(n). Tpasigrais > Fig, 6.1.9 Working of ANN TECHNIC’, PUBLICATIONS® - an upsthrust for knowledge Scanned with CamScanner 10 Introduction 1 Ney pel Trom the above Fig. 6.1.9, ene eee details utilised by the artificial neural networ} a mi) + In the artificial neural network, these weights often indicate how well ney connected to one another. Inside the computer unit, a summary of each y, input is created. aie * Jf the weighted total is equal to zero, the output is rendered non-zero by bias; otherwise, something else is done to scale up the output to the reaction. The weight for bias is 1 and the input is the same. © The sum of the weighted inputs in this case can range from 0 to positive i Here, a certain maximum value is benchmarked fo maintain the response the bounds of the intended value and the sum of the weighted inputs through the activation function. | * The set of transfer functions utilised to produce the desired output is refer, as the activation function. * A variety of activation functions exist, although they are mainly either ling non-linear sets of functions. The binary, linear and tan hyperbolic sign activation function sets are a few of the often employed sets of actiy functions. Let's examine each of these in more detail : | * Binary : The output of a binary activation function is either a one or a zero, a threshold value has been established in order to achieve this. The final outp, the activation function is returned as one or 0 depending on whether the weighted input of neurons is greater than 1. Sigmoidal hyperbolic : Most people think of the sigmoidal hyperbolic functio a "S" curve. Here, the output from the actual net input is approximated using tan hyperbolic function. The definition of the function is, FO) = (1/1 + exp( 272?) Types of ANN * Artificial neural networks (ANN) come in a variety of forms, and they all ca Out tasks in a way that is comparable to how human brain ne functions. Most artificial neural networks will share certain characteristics with biological counterpart that is more complicated, and they are quite good at wh they are meant to do segmentation or categorization, as examples. 1. Feedback ANN : In this kind of ANN, the Output loops back into the networ fo achieve the best intemally evolved results. The feedback networks a uron and netwo TECHNICAL PUBLICATIONS® - an upthrust for knowledge ar or vidal ation it of net Las Leeming ett weit ‘ntroducton to Neural Networks excellent for addressing optimization jinformat ub : back into themselves, Utilizing feeds ANE oe ie ee poe ls, the internal system error 2, Feed-Forward ANN : A feed-forward network is a type of neural network that consists of at least one layer of neurons as well as input and output layers. The network's intensity can be observed based on the collective behaviour of the connected neurons, and the output is chosen by evaluating the network's output in the context of its input, The main benefit of this network is that it learns to assess and identify input patterns. pu ANN Model + A mathematical function designed as a basic representation of a real (biological) neuron is called an artificial neuron. 1. The McCulloch-Pitts neuron : A threshold logic unit, a simplified representation of actual neurons, is used here. 2. The activations of other neurons are brought in by a network of input connections. 3. The inputs are added up by a processing unit, which then uses a nonlinear activation function (such as a squashing, transfer, or threshold function) to activate the system. 4, Other neurons receive the information from an output line. Basic elements of ANN + Three fundamental elements make up a neuron : Weights, thresholds and a single activation function. Fig. 6.1.10 depicts an Artificial Neural Network (ANN) model based on Liclogical brain systems. ANN model shawn in Fig: 6.1.11 and neural network adjustment is shown in Fig, 6.1.12 xO Activation function ‘Synaptic weights Fig, 6.1.10 Basic elements of artificial neural network TECHNICAL PUBLICATIONS - an up-hrst for knowledge Scanned with CamScanner Machine Leeming 6-14 tnt NU Noten, Input Fig. 6.1.17 Solution of an example EESED Applications of ANN 1. Social media : Social media makes extensive use of artificial neural networks. Take | Facebook's "People we may know” tool, for instance, which advises users to senq | friend requests based on potential familiarity. The individuals we could know are } determined by employing Artificial Neural Networks, which examine our profile, Interests, existing friends, their friends, and several other characteristics jg determine who we might know. Facial recognition is a typical use of machine learning in social media. Convolutional neural networks are used to locate approximately 100 reference points on the subject's face and then compare them to points already present in the database. 2 Sales and marketing : Based on our previous browsing behaviour, e-commerce sites like Amazon and Flipkart will suggest things to us when we log in. Similar to how Zomato, Swiggy, etc. will present restaurant suggestions based on our Preferences and previous order history if we love pasta. It is done by using individualised marketing, which is true across all new-age marketing categories, including book sites, movie services, hospitality sites, etc. The marketing efforts are then customised in accordance with the customer's preferences, dislikes, Previous purchases, etc. using artificial neural networks. 3. Healthcare : Oncology uses artificial neural networks to train algorithms that can recognise malignant tissue at the microscopic level with the same accuracy as skilled medical professionals. Using facial analysis on the images of the patients, certain rare diseases that can appear physically can be detected in their early stages. Therefore, the widespread adoption of artificial neural networks in the healthcare sector can only increase the diagnostic skills of healthcare professionals and, in the long run, raise the standard of healthcare globally. 4. Personal assistants : We've all heard of Siri, Alexa, Cortana, and other virtual assistants thanks to our smartphones! These are personal assistants that employ speech recognition and natural language processing to converse with their users and create responses in line with their needs, Artificial neural networks are used in natural language processing to manage many of these personal assistants’ TECHNICAL PUBLICATIONS® - an up-thrust for knowledge 2 Whe see nk? Dif 238 WB, 2, Write a detailed note on ANN architecture. : 3._Discuss ANN and its merits and demerits. : ——_ [2 Single Layer Neural Network + The simplest type of neural network is a single-layer neural network, which has !npus just one layer of input nodes and only one layer of receiving nodes, or in some circumstances, just one receiving node, sending —_ weighted inputs to each other. + This singlelayer structure served as the cumerstone for later, more complex systems. + The term ‘perceptron” refers to one of the earliest types of single-layer neural networks. A function based on inputs, once more based on single neurons in the physiology of the human brain, would be retumed by the perceptron. + Perceptron models resemble "logic gates” that perform specific functions in some ways : Depending on the weighted inputs, a perceptron will either deliver a signal or not. The single-layer binary linear classifier is a different kind of single-layer neural network that can divide inputs into one of two groups. © One way to think of single-layer neural networks is as a subset of feedforward neural networks, in which data only flows from the inputs to the output in one direction. Again, this distinguishes these basic networks from far more complex systems, such as ones that operate through gradient descent or backpropagation. Input layer ‘Output layer Fig. 6.2.1 Single layer feedforward neural network TECHNICAL PUBLICATIONS® - an upthrust for inowledge 6-16 Introduction to Ne Meche teamby 8218 Perceptron Model ‘Simple Perceptron for Pattern Classification p ; © Pattern classification into two or more categories can be done using networks. The perceptron leaming rule is used to train the perceptr moving on fo the broad multiclass classification, we will first tJ classification into two categories. , * All that is required for classification into just two categories 1S a sin neuron. Here, bipolar neurons will be used. The simplest architecture do the task is one input layer, one output layer with N neurons, and layers. + For the oulput neurons, we will utilise a different transfer function, a the equation 1 below. A single layer perceptron network is shown in Fig Equation 1: 1 if y > y= {0 if -O

You might also like