|
2 | 2 | "cells": [
|
3 | 3 | {
|
4 | 4 | "cell_type": "markdown",
|
5 |
| - "id": "0f289675", |
| 5 | + "id": "e61ebd40", |
6 | 6 | "metadata": {},
|
7 | 7 | "source": [
|
8 | 8 | "<h3>NLP Tutorial: Text Classification Using Spacy Word Embeddings</h3>"
|
9 | 9 | ]
|
10 | 10 | },
|
11 | 11 | {
|
12 | 12 | "cell_type": "markdown",
|
13 |
| - "id": "d66b56e0", |
| 13 | + "id": "9b6e2a9d", |
14 | 14 | "metadata": {},
|
15 | 15 | "source": [
|
16 | 16 | "#### Problem Statement\n",
|
|
41 | 41 | {
|
42 | 42 | "cell_type": "code",
|
43 | 43 | "execution_count": 4,
|
44 |
| - "id": "16363902", |
| 44 | + "id": "5e5e94c2", |
45 | 45 | "metadata": {},
|
46 | 46 | "outputs": [
|
47 | 47 | {
|
|
136 | 136 | {
|
137 | 137 | "cell_type": "code",
|
138 | 138 | "execution_count": 5,
|
139 |
| - "id": "a3b3c28e", |
| 139 | + "id": "62809942", |
140 | 140 | "metadata": {},
|
141 | 141 | "outputs": [
|
142 | 142 | {
|
|
159 | 159 | },
|
160 | 160 | {
|
161 | 161 | "cell_type": "markdown",
|
162 |
| - "id": "045db059", |
| 162 | + "id": "7134e686", |
163 | 163 | "metadata": {},
|
164 | 164 | "source": [
|
165 | 165 | "From the above, we can see that almost the labels(classes) occured equal number of times and balanced. There is no problem of class imbalance and hence no need to apply any balancing techniques like undersampling, oversampling etc."
|
|
168 | 168 | {
|
169 | 169 | "cell_type": "code",
|
170 | 170 | "execution_count": 6,
|
171 |
| - "id": "3522310e", |
| 171 | + "id": "cbc8320c", |
172 | 172 | "metadata": {},
|
173 | 173 | "outputs": [
|
174 | 174 | {
|
|
256 | 256 | },
|
257 | 257 | {
|
258 | 258 | "cell_type": "markdown",
|
259 |
| - "id": "9a247477", |
| 259 | + "id": "db288e82", |
260 | 260 | "metadata": {},
|
261 | 261 | "source": [
|
262 | 262 | "**Get spacy word vectors and store them in a pandas dataframe**"
|
|
265 | 265 | {
|
266 | 266 | "cell_type": "code",
|
267 | 267 | "execution_count": null,
|
268 |
| - "id": "5a26a0f7", |
| 268 | + "id": "d09b11d0", |
269 | 269 | "metadata": {},
|
270 | 270 | "outputs": [],
|
271 | 271 | "source": [
|
|
276 | 276 | {
|
277 | 277 | "cell_type": "code",
|
278 | 278 | "execution_count": 7,
|
279 |
| - "id": "135943df", |
| 279 | + "id": "c80141f0", |
280 | 280 | "metadata": {},
|
281 | 281 | "outputs": [],
|
282 | 282 | "source": [
|
|
287 | 287 | {
|
288 | 288 | "cell_type": "code",
|
289 | 289 | "execution_count": 39,
|
290 |
| - "id": "bd570f99", |
| 290 | + "id": "e07897f7", |
291 | 291 | "metadata": {},
|
292 | 292 | "outputs": [
|
293 | 293 | {
|
|
385 | 385 | {
|
386 | 386 | "cell_type": "code",
|
387 | 387 | "execution_count": 42,
|
388 |
| - "id": "602ea3c4", |
| 388 | + "id": "84f8b618", |
389 | 389 | "metadata": {},
|
390 | 390 | "outputs": [],
|
391 | 391 | "source": [
|
|
402 | 402 | {
|
403 | 403 | "cell_type": "code",
|
404 | 404 | "execution_count": 47,
|
405 |
| - "id": "d3cbdab6", |
| 405 | + "id": "e8d475c0", |
406 | 406 | "metadata": {},
|
407 | 407 | "outputs": [],
|
408 | 408 | "source": [
|
|
415 | 415 | {
|
416 | 416 | "cell_type": "code",
|
417 | 417 | "execution_count": 51,
|
418 |
| - "id": "a5d3a48c", |
| 418 | + "id": "53b6072f", |
419 | 419 | "metadata": {},
|
420 | 420 | "outputs": [
|
421 | 421 | {
|
|
449 | 449 | {
|
450 | 450 | "cell_type": "code",
|
451 | 451 | "execution_count": 52,
|
452 |
| - "id": "8fa685ad", |
| 452 | + "id": "0e074362", |
453 | 453 | "metadata": {},
|
454 | 454 | "outputs": [
|
455 | 455 | {
|
|
477 | 477 | {
|
478 | 478 | "cell_type": "code",
|
479 | 479 | "execution_count": 53,
|
480 |
| - "id": "c6b36c17", |
| 480 | + "id": "46c78b8f", |
481 | 481 | "metadata": {
|
482 | 482 | "scrolled": true
|
483 | 483 | },
|
|
516 | 516 | },
|
517 | 517 | {
|
518 | 518 | "cell_type": "markdown",
|
519 |
| - "id": "aa8b852f", |
| 519 | + "id": "4e8bb2b8", |
520 | 520 | "metadata": {},
|
521 | 521 | "source": [
|
522 | 522 | "**Confusion Matrix**"
|
523 | 523 | ]
|
524 | 524 | },
|
525 | 525 | {
|
526 | 526 | "cell_type": "code",
|
527 |
| - "execution_count": null, |
528 |
| - "id": "80bc0ff6", |
| 527 | + "execution_count": 55, |
| 528 | + "id": "e54d8240", |
529 | 529 | "metadata": {},
|
530 |
| - "outputs": [], |
| 530 | + "outputs": [ |
| 531 | + { |
| 532 | + "data": { |
| 533 | + "text/plain": [ |
| 534 | + "Text(69.0, 0.5, 'Truth')" |
| 535 | + ] |
| 536 | + }, |
| 537 | + "execution_count": 55, |
| 538 | + "metadata": {}, |
| 539 | + "output_type": "execute_result" |
| 540 | + }, |
| 541 | + { |
| 542 | + "data": { |
| 543 | + "image/png": "\n", |
| 544 | + "text/plain": [ |
| 545 | + "<Figure size 720x504 with 2 Axes>" |
| 546 | + ] |
| 547 | + }, |
| 548 | + "metadata": { |
| 549 | + "needs_background": "light" |
| 550 | + }, |
| 551 | + "output_type": "display_data" |
| 552 | + } |
| 553 | + ], |
531 | 554 | "source": [
|
532 | 555 | "#finally print the confusion matrix for the best model\n",
|
533 | 556 | "from sklearn.metrics import confusion_matrix\n",
|
|
541 | 564 | "plt.xlabel('Prediction')\n",
|
542 | 565 | "plt.ylabel('Truth')"
|
543 | 566 | ]
|
| 567 | + }, |
| 568 | + { |
| 569 | + "cell_type": "markdown", |
| 570 | + "id": "e6320b77", |
| 571 | + "metadata": {}, |
| 572 | + "source": [ |
| 573 | + "#### Key Takeaways\n", |
| 574 | + "\n", |
| 575 | + "1. KNN model which didn't perform well in the vectorization techniques like Bag of words, and TF-IDF due to very **high dimensional vector space**, performed really well with glove vectors due to only **300-dimensional** vectors and very good embeddings(similar and related words have almost similar embeddings) for the given text data.\n", |
| 576 | + "\n", |
| 577 | + "2. MultinomialNB model performed decently well but did not come into the top list because in the 300-dimensional vectors we also have the negative values present. The Naive Bayes model does not fit the data if there are **negative values**. So, to overcome this shortcoming, we have used the **Min-Max scaler** to bring down all the values between 0 to 1. In this process, there will be a possibility of variance and information loss among the data. But anyhow we got a decent recall and f1 scores." |
| 578 | + ] |
544 | 579 | }
|
545 | 580 | ],
|
546 | 581 | "metadata": {
|
|
0 commit comments