diff --git a/10_stop_words/stop_words_exercise.ipynb b/10_stop_words/stop_words_exercise.ipynb new file mode 100644 index 0000000..28680ed --- /dev/null +++ b/10_stop_words/stop_words_exercise.ipynb @@ -0,0 +1,234 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b17f58fa", + "metadata": { + "id": "b17f58fa" + }, + "source": [ + "### **Stop Words: Exercise**" + ] + }, + { + "cell_type": "markdown", + "id": "23a26def", + "metadata": { + "id": "23a26def" + }, + "source": [ + "- **Run this cell to import all necessary packages**" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "34f02550", + "metadata": { + "id": "34f02550" + }, + "outputs": [], + "source": [ + "#import spacy and load the model\n", + "\n", + "import spacy\n", + "nlp = spacy.load(\"en_core_web_sm\")" + ] + }, + { + "cell_type": "markdown", + "id": "0fe230d8", + "metadata": { + "id": "0fe230d8" + }, + "source": [ + "**Exercise1:** \n", + "- From a Given Text, Count the number of stop words in it.\n", + "- Print the percentage of stop word tokens compared to all tokens in a given text." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "646c9e7a", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "646c9e7a", + "outputId": "7d59339e-bf53-4239-eda5-134e6af42e22" + }, + "outputs": [], + "source": [ + "text = '''\n", + "Thor: Love and Thunder is a 2022 American superhero film based on Marvel Comics featuring the character Thor, produced by Marvel Studios and \n", + "distributed by Walt Disney Studios Motion Pictures. It is the sequel to Thor: Ragnarok (2017) and the 29th film in the Marvel Cinematic Universe (MCU).\n", + "The film is directed by Taika Waititi, who co-wrote the script with Jennifer Kaytin Robinson, and stars Chris Hemsworth as Thor alongside Christian Bale, Tessa Thompson,\n", + "Jaimie Alexander, Waititi, Russell Crowe, and Natalie Portman. In the film, Thor attempts to find inner peace, but must return to action and recruit Valkyrie (Thompson),\n", + "Korg (Waititi), and Jane Foster (Portman)—who is now the Mighty Thor—to stop Gorr the God Butcher (Bale) from eliminating all gods.\n", + "'''\n", + "\n", + "#step1: Create the object 'doc' for the given text using nlp()\n", + "\n", + "\n", + "\n", + "#step2: define the variables to keep track of stopwords count and total words count\n", + "\n", + "\n", + "\n", + "#step3: iterate through all the words in the document\n", + "\n", + "\n", + "\n", + "#step4: print the count of stop words\n", + "\n", + " \n", + "\n", + "#step5: print the percentage of stop words compared to total words in the text\n" + ] + }, + { + "cell_type": "markdown", + "id": "vsJaC5a-ldY-", + "metadata": { + "id": "vsJaC5a-ldY-" + }, + "source": [ + "**Exercise2:** \n", + "\n", + "- Spacy default implementation considers **\"not\"** as a stop word. But in some scenarios removing 'not' will completely change the meaning of the statement/text. For Example, consider these two statements:\n", + "\n", + " - this is a good movie ----> Positive Statement\n", + " - this is not a good movie ----> Negative Statement\n", + "\n", + "- So, after applying stopwords to those 2 texts, both will return **\"good movie\"** and does not respect the polarity/sentiments of text.\n", + " \n", + "- Now, your task is to remove this stop word **\"not\"** in spaCy and help in distinguishing the texts.\n", + "\n", + "\n", + "- **Hint:** GOOGLE IT! Google is your friend.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "4e9e663a", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4e9e663a", + "outputId": "72779ead-6cb9-4f92-da54-3e3a882c2069" + }, + "outputs": [], + "source": [ + "#use this pre-processing function to pass the text and to remove all the stop words and finally get the cleaned form\n", + "def preprocess(text):\n", + " doc = nlp(text)\n", + " no_stop_words = [token.text for token in doc if not token.is_stop]\n", + " return \" \".join(no_stop_words) \n", + "\n", + "\n", + "#Step1: remove the stopword 'not' in spacy\n", + "\n", + "\n", + "\n", + "#step2: send the two texts given above into the pre-process function and store the transformed texts\n", + "\n", + "\n", + "\n", + "#step3: finally print those 2 transformed texts\n" + ] + }, + { + "cell_type": "markdown", + "id": "RWnHxZy-Fv5S", + "metadata": { + "id": "RWnHxZy-Fv5S" + }, + "source": [ + "**Exercise3:** \n", + "\n", + "- From a given text, output the **most frequently** used token after removing all the stop word tokens and punctuations in it. \n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "GfLMTZmBFlPI", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "GfLMTZmBFlPI", + "outputId": "448095a9-954b-43e9-ad86-da7d48aed72c" + }, + "outputs": [], + "source": [ + "text = ''' The India men's national cricket team, also known as Team India or the Men in Blue, represents India in men's international cricket.\n", + "It is governed by the Board of Control for Cricket in India (BCCI), and is a Full Member of the International Cricket Council (ICC) with Test,\n", + "One Day International (ODI) and Twenty20 International (T20I) status. Cricket was introduced to India by British sailors in the 18th century, and the \n", + "first cricket club was established in 1792. India's national cricket team played its first Test match on 25 June 1932 at Lord's, becoming the sixth team to be\n", + "granted test cricket status.\n", + "'''\n", + "\n", + "\n", + "#step1: Create the object 'doc' for the given text using nlp()\n", + "\n", + "\n", + "\n", + "#step2: remove all the stop words and punctuations and store all the remaining tokens in a new list\n", + "\n", + "\n", + "\n", + "#step3: create a new dictionary and get the frequency of words by iterating through the list which contains stored tokens \n", + "\n", + "\n", + "\n", + "#step4: get the maximum frequency word\n", + "\n", + "\n", + "\n", + "#step5: finally print the result\n" + ] + }, + { + "cell_type": "markdown", + "id": "ByUPtcy9EsCn", + "metadata": { + "id": "ByUPtcy9EsCn" + }, + "source": [ + "## [**Solution**](./stop_words_exercise_solutions.ipynb)" + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "stop_words_exercise_solutions.ipynb", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/10_stop_words/stop_words_exercise_solutions.ipynb b/10_stop_words/stop_words_exercise_solutions.ipynb new file mode 100644 index 0000000..20eac48 --- /dev/null +++ b/10_stop_words/stop_words_exercise_solutions.ipynb @@ -0,0 +1,270 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b17f58fa", + "metadata": { + "id": "b17f58fa" + }, + "source": [ + "### **Stop Words: Solutions**" + ] + }, + { + "cell_type": "markdown", + "id": "23a26def", + "metadata": { + "id": "23a26def" + }, + "source": [ + "- **Run this cell to import all necessary packages**" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "34f02550", + "metadata": { + "id": "34f02550" + }, + "outputs": [], + "source": [ + "#import spacy and loading the model\n", + "\n", + "import spacy\n", + "nlp = spacy.load(\"en_core_web_sm\")" + ] + }, + { + "cell_type": "markdown", + "id": "0fe230d8", + "metadata": { + "id": "0fe230d8" + }, + "source": [ + "**Exercise1:** \n", + "- From a Given Text, Count the number of stop words in it.\n", + "- Print the percentage of stop word tokens compared to all tokens in a given text." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "646c9e7a", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "646c9e7a", + "outputId": "7d59339e-bf53-4239-eda5-134e6af42e22" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Total Stop words presented in the given text: 40\n", + "Percentage of Stop words presented in the given text: 25.0 %\n" + ] + } + ], + "source": [ + "text = '''\n", + "Thor: Love and Thunder is a 2022 American superhero film based on Marvel Comics featuring the character Thor, produced by Marvel Studios and \n", + "distributed by Walt Disney Studios Motion Pictures. It is the sequel to Thor: Ragnarok (2017) and the 29th film in the Marvel Cinematic Universe (MCU).\n", + "The film is directed by Taika Waititi, who co-wrote the script with Jennifer Kaytin Robinson, and stars Chris Hemsworth as Thor alongside Christian Bale, Tessa Thompson,\n", + "Jaimie Alexander, Waititi, Russell Crowe, and Natalie Portman. In the film, Thor attempts to find inner peace, but must return to action and recruit Valkyrie (Thompson),\n", + "Korg (Waititi), and Jane Foster (Portman)—who is now the Mighty Thor—to stop Gorr the God Butcher (Bale) from eliminating all gods.\n", + "'''\n", + "\n", + "#step1: Create the object 'doc' for the given text using nlp()\n", + "doc = nlp(text)\n", + "\n", + "\n", + "#step2: define the variables to keep track of stopwords count and total words count\n", + "stop_words_count = 0\n", + "total_words_count = 0\n", + "\n", + "\n", + "#step3: iterate through all the words in the document\n", + "for token in doc:\n", + " if token.is_stop: #check whether given token is stop word or not and increment accordingly \n", + " stop_words_count += 1\n", + " total_words_count += 1 #increment the total_words_count\n", + "\n", + "\n", + "#step4: print the count of stop words\n", + "print(f\"Total Stop words presented in the given text: {stop_words_count}\")\n", + " \n", + "\n", + "#step5: print the percentage of stop words compared to total words in the text\n", + "percentage_stop_words = (stop_words_count / total_words_count) * 100\n", + "print(f\"Percentage of Stop words presented in the given text: {percentage_stop_words} %\")" + ] + }, + { + "cell_type": "markdown", + "id": "vsJaC5a-ldY-", + "metadata": { + "id": "vsJaC5a-ldY-" + }, + "source": [ + "**Exercise2:** \n", + "\n", + "- Spacy default implementation considers **\"not\"** as a stop word. But in some scenarios removing 'not' will completely change the meaning of the statement/text. For Example, consider these two statements:\n", + "\n", + " - this is a good movie ----> Positive Statement\n", + " - this is not a good movie ----> Negative Statement\n", + "\n", + "- So, after applying stopwords to those 2 texts, both will return **\"good movie\"** and does not respect the polarity/sentiments of text.\n", + " \n", + "- Now, your task is to remove this stop word **\"not\"** in spaCy and help in distinguishing the texts.\n", + "\n", + "\n", + "- **Hint:** GOOGLE IT! Google is your friend.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4e9e663a", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4e9e663a", + "outputId": "72779ead-6cb9-4f92-da54-3e3a882c2069" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Text1: good movie\n", + "Text2: not good movie\n" + ] + } + ], + "source": [ + "#use this pre-processing function to pass the text and to remove all the stop words and finally get the cleaned form\n", + "def preprocess(text):\n", + " doc = nlp(text)\n", + " no_stop_words = [token.text for token in doc if not token.is_stop]\n", + " return \" \".join(no_stop_words) \n", + "\n", + "\n", + "#Step1: remove the stopword 'not' in spacy\n", + "nlp.vocab['not'].is_stop = False\n", + "\n", + "\n", + "#step2: send the two texts given above into the pre-process function and store the transformed texts\n", + "positive_text = preprocess('this is a good movie')\n", + "negative_text = preprocess('this is not a good movie')\n", + "\n", + "\n", + "#step3: finally print those 2 transformed texts\n", + "print(f\"Text1: {positive_text}\")\n", + "print(f\"Text2: {negative_text}\")" + ] + }, + { + "cell_type": "markdown", + "id": "RWnHxZy-Fv5S", + "metadata": { + "id": "RWnHxZy-Fv5S" + }, + "source": [ + "**Exercise3:** \n", + "\n", + "- From a given text, output the **most frequently** used token after removing all the stop word tokens and punctuations in it. \n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "GfLMTZmBFlPI", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "GfLMTZmBFlPI", + "outputId": "448095a9-954b-43e9-ad86-da7d48aed72c" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Maximum frequency word: India\n" + ] + } + ], + "source": [ + "text = ''' The India men's national cricket team, also known as Team India or the Men in Blue, represents India in men's international cricket.\n", + "It is governed by the Board of Control for Cricket in India (BCCI), and is a Full Member of the International Cricket Council (ICC) with Test,\n", + "One Day International (ODI) and Twenty20 International (T20I) status. Cricket was introduced to India by British sailors in the 18th century, and the \n", + "first cricket club was established in 1792. India's national cricket team played its first Test match on 25 June 1932 at Lord's, becoming the sixth team to be\n", + "granted test cricket status.\n", + "'''\n", + "\n", + "\n", + "#step1: Create the object 'doc' for the given text using nlp()\n", + "doc = nlp(text)\n", + "\n", + "\n", + "#step2: remove all the stop words and punctuations and store all the remaining tokens in a new list\n", + "remaining_tokens = []\n", + "for token in doc:\n", + " if token.is_stop or token.is_punct: #check whether a given token is stop word or punctuations\n", + " continue\n", + " remaining_tokens.append(token.text)\n", + "\n", + "\n", + "#step3: create a new dictionary and get the frequency of words by iterating through the list which contains stored tokens \n", + "frequency_tokens = {}\n", + "for token in remaining_tokens:\n", + " if token != '\\n' and token != ' ': #As spacy considers new line and empty spaces as seperate token, it's better to ignore them\n", + " if token not in frequency_tokens: #if a particular token occurs for the first time, we initialise it to 1\n", + " frequency_tokens[token] = 1\n", + " else:\n", + " frequency_tokens[token] += 1 #if a partcular token is already present, then increment by 1 based on value already presented\n", + "\n", + "\n", + "#step4: get the maximum frequency word\n", + "max_freq_word = max(frequency_tokens.keys(), key=(lambda key: frequency_tokens[key]))\n", + "\n", + "\n", + "#step5: finally print the result\n", + "print(f\"Maximum frequency word: {max_freq_word}\") " + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "stop_words_exercise_solutions.ipynb", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}