diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 86880091..0a046b41 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -23,7 +23,7 @@ The list of topics for which we are looking for content are provided below along - Interacting with Databases - [Link](https://github.com/animator/learn-python/tree/main/contrib/database) - Web Scrapping - [Link](https://github.com/animator/learn-python/tree/main/contrib/web-scrapping) - API Development - [Link](https://github.com/animator/learn-python/tree/main/contrib/api-development) -- Data Structures & Algorithms - [Link](https://github.com/animator/learn-python/tree/main/contrib/ds-algorithms) +- Data Structures & Algorithms - [Link](https://github.com/animator/learn-python/tree/main/contrib/ds-algorithms) **(Not accepting)** - Python Mini Projects - [Link](https://github.com/animator/learn-python/tree/main/contrib/mini-projects) **(Not accepting)** - Python Question Bank - [Link](https://github.com/animator/learn-python/tree/main/contrib/question-bank) **(Not accepting)** diff --git a/README.md b/README.md index a025510c..5cb7379d 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,5 @@ [![Discord Server Invite](https://img.shields.io/badge/DISCORD-JOIN%20SERVER-5663F7?style=for-the-badge&logo=discord&logoColor=white)](https://bit.ly/heyfoss) -This project is participating in GSSoC 2024. - -![gssoc-logo](https://github.com/foss42/awesome-generative-ai-apis/assets/1382619/670b651a-15d7-4869-a4d1-6613df09fa37) - Contributors should go through the [Contributing Guide](https://github.com/animator/learn-python/blob/main/CONTRIBUTING.md) to learn how you can contribute to the project. ![Learn Python 3 Logo](images/learn-python.png) diff --git a/contrib/advanced-python/asynchronous-context-managers-generators.md b/contrib/advanced-python/asynchronous-context-managers-generators.md new file mode 100644 index 00000000..00516495 --- /dev/null +++ b/contrib/advanced-python/asynchronous-context-managers-generators.md @@ -0,0 +1,110 @@ +## Asynchronous Context Managers and Generators in Python +Asynchronous programming in Python allows for more efficient use of resources by enabling tasks to run concurrently. Python provides support for asynchronous +context managers and generators, which help manage resources and perform operations asynchronously. + +### Asynchronous Context Managers +Asynchronous context managers are similar to regular context managers but are designed to work with asynchronous code. They use the async with statement and +typically include the '__aenter__' and '__aexit__' methods. + +### Creating an Asynchronous Context Manager +Here's a simple example of an asynchronous context manager: + +```bash +import asyncio + +class AsyncContextManager: + async def __aenter__(self): + print("Entering context") + await asyncio.sleep(1) # Simulate an async operation + return self + + async def __aexit__(self, exc_type, exc, tb): + print("Exiting context") + await asyncio.sleep(1) # Simulate cleanup + +async def main(): + async with AsyncContextManager() as acm: + print("Inside context") + +asyncio.run(main()) +``` + +Output: + +```bash +Entering context +Inside context +Exiting context +``` + +### Asynchronous Generators +Asynchronous generators allow you to yield values within an asynchronous function. They use the async def syntax along with the yield statement and are +iterated using the async for loop. + +### Creating an Asynchronous Generator +Here's a basic example of an asynchronous generator: + +```bash +import asyncio + +async def async_generator(): + for i in range(5): + await asyncio.sleep(1) # Simulate an async operation + yield i + +async def main(): + async for value in async_generator(): + print(value) + +asyncio.run(main()) +``` +Output: +```bash +0 +1 +2 +3 +4 +``` +### Combining Asynchronous Context Managers and Generators +You can combine asynchronous context managers and generators to create more complex and efficient asynchronous workflows. +Example: Fetching Data with an Async Context Manager and Generator +Consider a scenario where you need to fetch data from an API asynchronously and manage the connection using an asynchronous context manager: +```bash +import aiohttp +import asyncio + +class AsyncHTTPClient: + def __init__(self, url): + self.url = url + + async def __aenter__(self): + self.session = aiohttp.ClientSession() + self.response = await self.session.get(self.url) + return self.response + + async def __aexit__(self, exc_type, exc, tb): + await self.response.release() + await self.session.close() + +async def async_fetch(urls): + for url in urls: + async with AsyncHTTPClient(url) as response: + data = await response.text() + yield data + +async def main(): + urls = ["http://example.com", "http://example.org", "http://example.net"] + async for data in async_fetch(urls): + print(data) + +asyncio.run(main()) +``` +### Benefits of Asynchronous Context Managers and Generators +1. Efficient Resource Management: They help manage resources like network connections or file handles more efficiently by releasing them as soon as they are no longer needed. +2. Concurrency: They enable concurrent operations, improving performance in I/O-bound tasks such as network requests or file I/O. +3. Readability and Maintainability: They provide a clear and structured way to handle asynchronous operations, making the code easier to read and maintain. +### Summary +Asynchronous context managers and generators are powerful tools in Python that enhance the efficiency and readability +of asynchronous code. By using 'async with' for resource management and 'async for' for iteration, you can write more performant and maintainable asynchronous +programs. diff --git a/contrib/advanced-python/index.md b/contrib/advanced-python/index.md index b093e44e..81d1832e 100644 --- a/contrib/advanced-python/index.md +++ b/contrib/advanced-python/index.md @@ -18,3 +18,6 @@ - [Reduce](reduce-function.md) - [List Comprehension](list-comprehension.md) - [Eval Function](eval_function.md) +- [Magic Methods](magic-methods.md) +- [Asynchronous Context Managers & Generators](asynchronous-context-managers-generators.md) +- [Threading](threading.md) diff --git a/contrib/advanced-python/magic-methods.md b/contrib/advanced-python/magic-methods.md new file mode 100644 index 00000000..447e36b5 --- /dev/null +++ b/contrib/advanced-python/magic-methods.md @@ -0,0 +1,151 @@ +# Magic Methods + +Magic methods, also known as dunder (double underscore) methods, are special methods in Python that start and end with double underscores (`__`). +These methods allow you to define the behavior of objects for built-in operations and functions, enabling you to customize how your objects interact with the +language's syntax and built-in features. Magic methods make your custom classes integrate seamlessly with Python’s built-in data types and operations. + +**Commonly Used Magic Methods** + +1. **Initialization and Representation** + - `__init__(self, ...)`: Called when an instance of the class is created. Used for initializing the object's attributes. + - `__repr__(self)`: Returns a string representation of the object, useful for debugging and logging. + - `__str__(self)`: Returns a human-readable string representation of the object. + +**Example** : + + ```python + class Person: + def __init__(self, name, age): + self.name = name + self.age = age + + def __repr__(self): + return f"Person({self.name}, {self.age})" + + def __str__(self): + return f"{self.name}, {self.age} years old" + + p = Person("Alice", 30) + print(repr(p)) + print(str(p)) + ``` + +**Output** : +```python +Person("Alice",30) +Alice, 30 years old +``` + +2. **Arithmetic Operations** + - `__add__(self, other)`: Defines behavior for the `+` operator. + - `__sub__(self, other)`: Defines behavior for the `-` operator. + - `__mul__(self, other)`: Defines behavior for the `*` operator. + - `__truediv__(self, other)`: Defines behavior for the `/` operator. + + +**Example** : + + ```python + class Vector: + def __init__(self, x, y): + self.x = x + self.y = y + + def __add__(self, other): + return Vector(self.x + other.x, self.y + other.y) + + def __repr__(self): + return f"Vector({self.x}, {self.y})" + + v1 = Vector(2, 3) + v2 = Vector(1, 1) + v3 = v1 + v2 + print(v3) + ``` + +**Output** : + +```python +Vector(3, 4) +``` + +3. **Comparison Operations** + - `__eq__(self, other)`: Defines behavior for the `==` operator. + - `__lt__(self, other)`: Defines behavior for the `<` operator. + - `__le__(self, other)`: Defines behavior for the `<=` operator. + +**Example** : + + ```python + class Person: + def __init__(self, name, age): + self.name = name + self.age = age + + def __eq__(self, other): + return self.age == other.age + + def __lt__(self, other): + return self.age < other.age + + p1 = Person("Alice", 30) + p2 = Person("Bob", 25) + print(p1 == p2) + print(p1 < p2) + ``` + + **Output** : + + ```python + False + False + ``` + +5. **Container and Sequence Methods** + + - `__len__(self)`: Defines behavior for the `len()` function. + - `__getitem__(self, key)`: Defines behavior for indexing (`self[key]`). + - `__setitem__(self, key, value)`: Defines behavior for item assignment (`self[key] = value`). + - `__delitem__(self, key)`: Defines behavior for item deletion (`del self[key]`). + +**Example** : + + ```python + class CustomList: + def __init__(self, *args): + self.items = list(args) + + def __len__(self): + return len(self.items) + + def __getitem__(self, index): + return self.items[index] + + def __setitem__(self, index, value): + self.items[index] = value + + def __delitem__(self, index): + del self.items[index] + + def __repr__(self): + return f"CustomList({self.items})" + + cl = CustomList(1, 2, 3) + print(len(cl)) + print(cl[1]) + cl[1] = 5 + print(cl) + del cl[1] + print(cl) + ``` + +**Output** : +```python +3 +2 +CustomList([1, 5, 3]) +CustomList([1, 3]) +``` + +Magic methods provide powerful ways to customize the behavior of your objects and make them work seamlessly with Python's syntax and built-in functions. +Use them judiciously to enhance the functionality and readability of your classes. diff --git a/contrib/advanced-python/regular_expressions.md b/contrib/advanced-python/regular_expressions.md index 65ff2c2b..81c883ec 100644 --- a/contrib/advanced-python/regular_expressions.md +++ b/contrib/advanced-python/regular_expressions.md @@ -1,36 +1,144 @@ ## Regular Expressions in Python -Regular expressions (regex) are a powerful tool for pattern matching and text manipulation. +Regular expressions (regex) are a powerful tool for pattern matching and text manipulation. Python's re module provides comprehensive support for regular expressions, enabling efficient text processing and validation. +Regular expressions (regex) are a versitile tool for matching patterns in strings. In Python, the `re` module provides support for working with regular expressions. ## 1. Introduction to Regular Expressions -A regular expression is a sequence of characters defining a search pattern. Common use cases include validating input, searching within text, and extracting +A regular expression is a sequence of characters defining a search pattern. Common use cases include validating input, searching within text, and extracting specific patterns. ## 2. Basic Syntax Literal Characters: Match exact characters (e.g., abc matches "abc"). -Metacharacters: Special characters like ., *, ?, +, ^, $, [ ], and | used to build patterns. +Metacharacters: Special characters like ., \*, ?, +, ^, $, [ ], and | used to build patterns. **Common Metacharacters:** -* .: Any character except newline. -* ^: Start of the string. -* $: End of the string. -* *: 0 or more repetitions. -* +: 1 or more repetitions. -* ?: 0 or 1 repetition. -* []: Any one character inside brackets (e.g., [a-z]). -* |: Either the pattern before or after. - +- .: Any character except newline. +- ^: Start of the string. +- $: End of the string. +- *: 0 or more repetitions. +- +: 1 or more repetitions. +- ?: 0 or 1 repetition. +- []: Any one character inside brackets (e.g., [a-z]). +- |: Either the pattern before or after. +- \ : Used to drop the special meaning of character following it +- {} : Indicate the number of occurrences of a preceding regex to match. +- () : Enclose a group of Regex + +Examples: + +1. `.` + +```bash +import re +pattern = r'c.t' +text = 'cat cot cut cit' +matches = re.findall(pattern, text) +print(matches) # Output: ['cat', 'cot', 'cut', 'cit'] +``` + +2. `^` + +```bash +pattern = r'^Hello' +text = 'Hello, world!' +match = re.search(pattern, text) +print(match.group() if match else 'No match') # Output: 'Hello' +``` + +3. `$` + +```bash +pattern = r'world!$' +text = 'Hello, world!' +match = re.search(pattern, text) +print(match.group() if match else 'No match') # Output: 'world!' +``` + +4. `*` + +```bash +pattern = r'ab*' +text = 'a ab abb abbb' +matches = re.findall(pattern, text) +print(matches) # Output: ['a', 'ab', 'abb', 'abbb'] +``` + +5. `+` + +```bash +pattern = r'ab+' +text = 'a ab abb abbb' +matches = re.findall(pattern, text) +print(matches) # Output: ['ab', 'abb', 'abbb'] +``` + +6. `?` + +```bash +pattern = r'ab?' +text = 'a ab abb abbb' +matches = re.findall(pattern, text) +print(matches) # Output: ['a', 'ab', 'ab', 'ab'] +``` + +7. `[]` + +```bash +pattern = r'[aeiou]' +text = 'hello world' +matches = re.findall(pattern, text) +print(matches) # Output: ['e', 'o', 'o'] +``` + +8. `|` + +```bash +pattern = r'cat|dog' +text = 'I have a cat and a dog.' +matches = re.findall(pattern, text) +print(matches) # Output: ['cat', 'dog'] +``` + +9. `\`` + +```bash +pattern = r'\$100' +text = 'The price is $100.' +match = re.search(pattern, text) +print(match.group() if match else 'No match') # Output: '$100' +``` + +10. `{}` + +```bash +pattern = r'\d{3}' +text = 'My number is 123456' +matches = re.findall(pattern, text) +print(matches) # Output: ['123', '456'] +``` + +11. `()` + +```bash +pattern = r'(cat|dog)' +text = 'I have a cat and a dog.' +matches = re.findall(pattern, text) +print(matches) # Output: ['cat', 'dog'] +``` + ## 3. Using the re Module **Key functions in the re module:** -* re.match(): Checks for a match at the beginning of the string. -* re.search(): Searches for a match anywhere in the string. -* re.findall(): Returns a list of all matches. -* re.sub(): Replaces matches with a specified string. +- re.match(): Checks for a match at the beginning of the string. +- re.search(): Searches for a match anywhere in the string. +- re.findall(): Returns a list of all matches. +- re.sub(): Replaces matches with a specified string. +- re.split(): Returns a list where the string has been split at each match. +- re.escape(): Escapes special character + Examples: -Examples: ```bash import re @@ -45,12 +153,20 @@ print(re.findall(r'\d+', 'abc123def456')) # Output: ['123', '456'] # Substitute matches print(re.sub(r'\d+', '#', 'abc123def456')) # Output: abc#def# + +#Return a list where it get matched +print(re.split("\s", txt)) #['The', 'Donkey', 'in', 'the','Town'] + +# Escape special character +print(re.escape("We are good to go")) #We\ are\ good\ to\ go ``` ## 4. Compiling Regular Expressions + Compiling regular expressions improves performance for repeated use. Example: + ```bash import re @@ -58,12 +174,15 @@ pattern = re.compile(r'\d+') print(pattern.match('123abc').group()) # Output: 123 print(pattern.search('abc123').group()) # Output: 123 print(pattern.findall('abc123def456')) # Output: ['123', '456'] + ``` ## 5. Groups and Capturing + Parentheses () group and capture parts of the match. Example: + ```bash import re @@ -76,21 +195,46 @@ if match: ``` ## 6. Special Sequences + Special sequences are shortcuts for common patterns: -* \d: Any digit. -* \D: Any non-digit. -* \w: Any alphanumeric character. -* \W: Any non-alphanumeric character. -* \s: Any whitespace character. -* \S: Any non-whitespace character. +- \A:Returns a match if the specified characters are at the beginning of the string. +- \b:Returns a match where the specified characters are at the beginning or at the end of a word. +- \B:Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word. +- \d: Any digit. +- \D: Any non-digit. +- \w: Any alphanumeric character. +- \W: Any non-alphanumeric character. +- \s: Any whitespace character. +- \S: Any non-whitespace character. +- \Z:Returns a match if the specified characters are at the end of the string. + Example: + ```bash import re print(re.search(r'\w+@\w+\.\w+', 'Contact: support@example.com').group()) # Output: support@example.com ``` +## 7.Sets + +A set is a set of characters inside a pair of square brackets [] with a special meaning: + +- [arn] : Returns a match where one of the specified characters (a, r, or n) is present. +- [a-n] : Returns a match for any lower case character, alphabetically between a and n. +- [^arn] : Returns a match for any character EXCEPT a, r, and n. +- [0123] : Returns a match where any of the specified digits (0, 1, 2, or 3) are present. +- [0-9] : Returns a match for any digit between 0 and 9. +- [0-5][0-9] : Returns a match for any two-digit numbers from 00 and 59. +- [a-zA-Z] : Returns a match for any character alphabetically between a and z, lower case OR upper case. +- [+] : In sets, +, \*, ., |, (), $,{} has no special meaning +- [+] means: return a match for any + character in the string. + ## Summary -Regular expressions are a versatile tool for text processing in Python. The re module offers powerful functions and metacharacters for pattern matching, -searching, and manipulation, making it an essential skill for handling complex text processing tasks. + +Regular expressions (regex) are a powerful tool for text processing in Python, offering a flexible way to match, search, and manipulate text patterns. The re module provides a comprehensive set of functions and metacharacters to tackle complex text processing tasks. +With regex, you can: +1.Match patterns: Use metacharacters like ., \*, ?, and {} to match specific patterns in text. +2.Search text: Employ functions like re.search() and re.match() to find occurrences of patterns in text. +3.Manipulate text: Utilize functions like re.sub() to replace patterns with new text. diff --git a/contrib/advanced-python/threading.md b/contrib/advanced-python/threading.md new file mode 100644 index 00000000..fa315335 --- /dev/null +++ b/contrib/advanced-python/threading.md @@ -0,0 +1,198 @@ +# Threading in Python +Threading is a sequence of instructions in a program that can be executed independently of the remaining process and +Threads are like lightweight processes that share the same memory space but can execute independently. +The process is an executable instance of a computer program. +This guide provides an overview of the threading module and its key functionalities. + +## Key Characteristics of Threads: +* Shared Memory: All threads within a process share the same memory space, which allows for efficient communication between threads. +* Independent Execution: Each thread can run independently and concurrently. +* Context Switching: The operating system can switch between threads, enabling concurrent execution. + +## Threading Module +This module will allows you to create and manage threads easily. This module includes several functions and classes to work with threads. + +**1. Creating Thread:** +To create a thread in Python, you can use the Thread class from the threading module. + +Example: +```python +import threading + +# Create a thread +thread = threading.Thread() + +# Start the thread +thread.start() + +# Wait for the thread to complete +thread.join() + +print("Thread has finished execution.") +``` +Output : +``` +Thread has finished execution. +``` +**2. Performing Task with Thread:** +We can also perform a specific task by thread by giving a function as target and its argument as arg ,as a parameter to Thread object. + +Example: + +```python +import threading + +# Define a function that will be executed by the thread +def print_numbers(arg): + for i in range(arg): + print(f"Thread: {i}") +# Create a thread +thread = threading.Thread(target=print_numbers,args=(5,)) + +# Start the thread +thread.start() + +# Wait for the thread to complete +thread.join() + +print("Thread has finished execution.") +``` +Output : +``` +Thread: 0 +Thread: 1 +Thread: 2 +Thread: 3 +Thread: 4 +Thread has finished execution. +``` +**3. Delaying a Task with Thread's Timer Function:** +We can set a time for which we want a thread to start. Timer function takes 4 arguments (interval,function,args,kwargs). + +Example: +```python +import threading + +# Define a function that will be executed by the thread +def print_numbers(arg): + for i in range(arg): + print(f"Thread: {i}") +# Create a thread after 3 seconds +thread = threading.Timer(3,print_numbers,args=(5,)) + +# Start the thread +thread.start() + +# Wait for the thread to complete +thread.join() + +print("Thread has finished execution.") +``` +Output : +``` +# after three second output will be generated +Thread: 0 +Thread: 1 +Thread: 2 +Thread: 3 +Thread: 4 +Thread has finished execution. +``` +**4. Creating Multiple Threads** +We can create and manage multiple threads to achieve concurrent execution. + +Example: +```python +import threading + +def print_numbers(thread_name): + for i in range(5): + print(f"{thread_name}: {i}") + +# Create multiple threads +thread1 = threading.Thread(target=print_numbers, args=("Thread 1",)) +thread2 = threading.Thread(target=print_numbers, args=("Thread 2",)) + +# Start the threads +thread1.start() +thread2.start() + +# Wait for both threads to complete +thread1.join() +thread2.join() + +print("Both threads have finished execution.") +``` +Output : +``` +Thread 1: 0 +Thread 1: 1 +Thread 2: 0 +Thread 1: 2 +Thread 1: 3 +Thread 2: 1 +Thread 2: 2 +Thread 2: 3 +Thread 2: 4 +Thread 1: 4 +Both threads have finished execution. +``` + +**5. Thread Synchronization** +When we create multiple threads and they access shared resources, there is a risk of race conditions and data corruption. To prevent this, you can use synchronization primitives such as locks. +A lock is a synchronization primitive that ensures that only one thread can access a shared resource at a time. + +Example: +```Python +import threading + +lock = threading.Lock() + +def print_numbers(thread_name): + for i in range(10): + with lock: + print(f"{thread_name}: {i}") + +# Create multiple threads +thread1 = threading.Thread(target=print_numbers, args=("Thread 1",)) +thread2 = threading.Thread(target=print_numbers, args=("Thread 2",)) + +# Start the threads +thread1.start() +thread2.start() + +# Wait for both threads to complete +thread1.join() +thread2.join() + +print("Both threads have finished execution.") +``` +Output : +``` +Thread 1: 0 +Thread 1: 1 +Thread 1: 2 +Thread 1: 3 +Thread 1: 4 +Thread 1: 5 +Thread 1: 6 +Thread 1: 7 +Thread 1: 8 +Thread 1: 9 +Thread 2: 0 +Thread 2: 1 +Thread 2: 2 +Thread 2: 3 +Thread 2: 4 +Thread 2: 5 +Thread 2: 6 +Thread 2: 7 +Thread 2: 8 +Thread 2: 9 +Both threads have finished execution. +``` + +A ```lock``` object is created using threading.Lock() and The ```with lock``` statement ensures that the lock is acquired before printing and released after printing. This prevents other threads from accessing the print statement simultaneously. + +## Conclusion +Threading in Python is a powerful tool for achieving concurrency and improving the performance of I/O-bound tasks. By understanding and implementing threads using the threading module, you can enhance the efficiency of your programs. To prevent race situations and maintain data integrity, keep in mind that thread synchronization must be properly managed. diff --git a/contrib/ds-algorithms/avl-trees.md b/contrib/ds-algorithms/avl-trees.md new file mode 100644 index 00000000..b87e82cb --- /dev/null +++ b/contrib/ds-algorithms/avl-trees.md @@ -0,0 +1,185 @@ +# AVL Tree + +In Data Structures and Algorithms, an **AVL Tree** is a self-balancing binary search tree (BST) where the difference between heights of left and right subtrees cannot be more than one for all nodes. It ensures that the tree remains balanced, providing efficient search, insertion, and deletion operations. + +## Points to be Remembered + +- **Balance Factor**: The difference in heights between the left and right subtrees of a node. It should be -1, 0, or +1 for all nodes in an AVL tree. +- **Rotations**: Tree rotations (left, right, left-right, right-left) are used to maintain the balance factor within the allowed range. + +## Real Life Examples of AVL Trees + +- **Databases**: AVL trees can be used to maintain large indexes for database tables, ensuring quick data retrieval. +- **File Systems**: Some file systems use AVL trees to keep track of free and used memory blocks. + +## Applications of AVL Trees + +AVL trees are used in various applications in Computer Science: + +- **Database Indexing** +- **Memory Allocation** +- **Network Routing Algorithms** + +Understanding these applications is essential for Software Development. + +## Operations in AVL Tree + +Key operations include: + +- **INSERT**: Insert a new element into the AVL tree. +- **SEARCH**: Find the position of an element in the AVL tree. +- **DELETE**: Remove an element from the AVL tree. + +## Implementing AVL Tree in Python + +```python +class AVLTreeNode: + def __init__(self, key): + self.key = key + self.left = None + self.right = None + self.height = 1 + +class AVLTree: + def insert(self, root, key): + if not root: + return AVLTreeNode(key) + + if key < root.key: + root.left = self.insert(root.left, key) + else: + root.right = self.insert(root.right, key) + + root.height = 1 + max(self.getHeight(root.left), self.getHeight(root.right)) + balance = self.getBalance(root) + + if balance > 1 and key < root.left.key: + return self.rotateRight(root) + if balance < -1 and key > root.right.key: + return self.rotateLeft(root) + if balance > 1 and key > root.left.key: + root.left = self.rotateLeft(root.left) + return self.rotateRight(root) + if balance < -1 and key < root.right.key: + root.right = self.rotateRight(root.right) + return self.rotateLeft(root) + + return root + + def search(self, root, key): + if not root or root.key == key: + return root + + if key < root.key: + return self.search(root.left, key) + + return self.search(root.right, key) + + def delete(self, root, key): + if not root: + return root + + if key < root.key: + root.left = self.delete(root.left, key) + elif key > root.key: + root.right = self.delete(root.right, key) + else: + if root.left is None: + temp = root.right + root = None + return temp + elif root.right is None: + temp = root.left + root = None + return temp + + temp = self.getMinValueNode(root.right) + root.key = temp.key + root.right = self.delete(root.right, temp.key) + + if root is None: + return root + + root.height = 1 + max(self.getHeight(root.left), self.getHeight(root.right)) + balance = self.getBalance(root) + + if balance > 1 and self.getBalance(root.left) >= 0: + return self.rotateRight(root) + if balance < -1 and self.getBalance(root.right) <= 0: + return self.rotateLeft(root) + if balance > 1 and self.getBalance(root.left) < 0: + root.left = self.rotateLeft(root.left) + return self.rotateRight(root) + if balance < -1 and self.getBalance(root.right) > 0: + root.right = self.rotateRight(root.right) + return self.rotateLeft(root) + + return root + + def rotateLeft(self, z): + y = z.right + T2 = y.left + y.left = z + z.right = T2 + z.height = 1 + max(self.getHeight(z.left), self.getHeight(z.right)) + y.height = 1 + max(self.getHeight(y.left), self.getHeight(y.right)) + return y + + def rotateRight(self, z): + y = z.left + T3 = y.right + y.right = z + z.left = T3 + z.height = 1 + max(self.getHeight(z.left), self.getHeight(z.right)) + y.height = 1 + max(self.getHeight(y.left), self.getHeight(y.right)) + return y + + def getHeight(self, root): + if not root: + return 0 + return root.height + + def getBalance(self, root): + if not root: + return 0 + return self.getHeight(root.left) - self.getHeight(root.right) + + def getMinValueNode(self, root): + if root is None or root.left is None: + return root + return self.getMinValueNode(root.left) + + def preOrder(self, root): + if not root: + return + print(root.key, end=' ') + self.preOrder(root.left) + self.preOrder(root.right) + +#Example usage +avl_tree = AVLTree() +root = None + +root = avl_tree.insert(root, 10) +root = avl_tree.insert(root, 20) +root = avl_tree.insert(root, 30) +root = avl_tree.insert(root, 40) +root = avl_tree.insert(root, 50) +root = avl_tree.insert(root, 25) + +print("Preorder traversal of the AVL tree is:") +avl_tree.preOrder(root) +``` + +## Output + +```markdown +Preorder traversal of the AVL tree is: +30 20 10 25 40 50 +``` + +## Complexity Analysis + +- **Insertion**: O(logn). Inserting a node involves traversing the height of the tree, which is logarithmic due to the balancing property. +- **Search**: O(logn). Searching for a node involves traversing the height of the tree. +- **Deletion**: O(log⁡n). Deleting a node involves traversing and potentially rebalancing the tree, maintaining the logarithmic height. \ No newline at end of file diff --git a/contrib/ds-algorithms/binary-tree.md b/contrib/ds-algorithms/binary-tree.md new file mode 100644 index 00000000..03da2cf8 --- /dev/null +++ b/contrib/ds-algorithms/binary-tree.md @@ -0,0 +1,231 @@ +# Binary Tree + +A binary tree is a non-linear data structure in which each node can have atmost two children, known as the left and the right child. It is a heirarchial data structure represented in the following way: + +``` + A...................Level 0 + / \ + B C.................Level 1 + / \ \ + D E G...............Level 2 +``` + +## Basic Terminologies + +- **Root node:** The topmost node in a tree is the root node. The root node does not have any parent. In the above example, **A** is the root node. +- **Parent node:** The predecessor of a node is called the parent of that node. **A** is the parent of **B** and **C**, **B** is the parent of **D** and **E** and **C** is the parent of **G**. +- **Child node:** The successor of a node is called the child of that node. **B** and **C** are children of **A**, **D** and **E** are children of **B** and **G** is the right child of **C**. +- **Leaf node:** Nodes without any children are called the leaf nodes. **D**, **E** and **G** are the leaf nodes. +- **Ancestor node:** Predecessor nodes on the path from the root to that node are called ancestor nodes. **A** and **B** are the ancestors of **E**. +- **Descendant node:** Successor nodes on the path from the root to that node are called descendant nodes. **B** and **E** are descendants of **A**. +- **Sibling node:** Nodes having the same parent are called sibling nodes. **B** and **C** are sibling nodes and so are **D** and **E**. +- **Level (Depth) of a node:** Number of edges in the path from the root to that node is the level of that node. The root node is always at level 0. The depth of root node is the depth of the tree. +- **Height of a node:** Number of edges in the path from that node to the deepest leaf is the height of that node. The height of the root is the height of a tree. Height of node **A** is 2, nodes **B** and **C** is 1 and nodes **D**, **E** and **G** is 0. + +## Types Of Binary Trees + +- **Full Binary Tree:** A binary tree where each node has 0 or 2 children is a full binary tree. +``` + A + / \ + B C + / \ + D E +``` +- **Complete Binary Tree:** A binary tree in which all levels are completely filled except the last level is a complete binary tree. Whenever new nodes are inserted, they are inserted from the left side. +``` + A + / \ + / \ + B C + / \ / + D E F +``` +- **Perfect Binary Tree:** A binary tree in which all nodes are completely filled, i.e., each node has two children is called a perfect binary tree. +``` + A + / \ + / \ + B C + / \ / \ + D E F G +``` +- **Skewed Binary Tree:** A binary tree in which each node has either 0 or 1 child is called a skewed binary tree. It is of two types - left skewed binary tree and right skewed binary tree. +``` + A A + \ / + B B + \ / + C C + Right skewed binary tree Left skewed binary tree +``` +- **Balanced Binary Tree:** A binary tree in which the height difference between the left and right subtree is not more than one and the subtrees are also balanced is a balanced binary tree. +``` + A + / \ + B C + / \ + D E +``` + +## Real Life Applications Of Binary Tree + +- **File Systems:** File systems employ binary trees to organize the folders and files, facilitating efficient search and access of files. +- **Decision Trees:** Decision tree, a supervised learning algorithm, utilizes binary trees, with each node representing a decision and its edges showing the possible outcomes. +- **Routing Algorithms:** In routing algorithms, binary trees are used to efficiently transfer data packets from the source to destination through a network of nodes. +- **Searching and sorting Algorithms:** Searching algorithms like binary search and sorting algorithms like heapsort heavily rely on binary trees. + +## Implementation of Binary Tree + +```python +from collections import deque + +class Node: + def __init__(self, data): + self.data = data + self.left = None + self.right = None + +class Binary_tree: + @staticmethod + def insert(root, data): + if root is None: + return Node(data) + q = deque() + q.append(root) + while q: + temp = q.popleft() + if temp.left is None: + temp.left = Node(data) + break + else: + q.append(temp.left) + if temp.right is None: + temp.right = Node(data) + break + else: + q.append(temp.right) + return root + + @staticmethod + def inorder(root): + if not root: + return + b.inorder(root.left) + print(root.data, end=" ") + b.inorder(root.right) + + @staticmethod + def preorder(root): + if not root: + return + print(root.data, end=" ") + b.preorder(root.left) + b.preorder(root.right) + + @staticmethod + def postorder(root): + if not root: + return + b.postorder(root.left) + b.postorder(root.right) + print(root.data, end=" ") + + @staticmethod + def levelorder(root): + if not root: + return + q = deque() + q.append(root) + while q: + temp = q.popleft() + print(temp.data, end=" ") + if temp.left is not None: + q.append(temp.left) + if temp.right is not None: + q.append(temp.right) + + @staticmethod + def delete(root, value): + q = deque() + q.append(root) + while q: + temp = q.popleft() + if temp is value: + temp = None + return + if temp.right: + if temp.right is value: + temp.right = None + return + else: + q.append(temp.right) + if temp.left: + if temp.left is value: + temp.left = None + return + else: + q.append(temp.left) + + @staticmethod + def delete_value(root, value): + if root is None: + return None + if root.left is None and root.right is None: + if root.data == value: + return None + else: + return root + x = None + q = deque() + q.append(root) + temp = None + while q: + temp = q.popleft() + if temp.data == value: + x = temp + if temp.left: + q.append(temp.left) + if temp.right: + q.append(temp.right) + if x: + y = temp.data + x.data = y + b.delete(root, temp) + return root + +b = Binary_tree() +root = None +root = b.insert(root, 10) +root = b.insert(root, 20) +root = b.insert(root, 30) +root = b.insert(root, 40) +root = b.insert(root, 50) +root = b.insert(root, 60) + +print("Preorder traversal:", end=" ") +b.preorder(root) + +print("\nInorder traversal:", end=" ") +b.inorder(root) + +print("\nPostorder traversal:", end=" ") +b.postorder(root) + +print("\nLevel order traversal:", end=" ") +b.levelorder(root) + +root = b.delete_value(root, 20) +print("\nLevel order traversal after deletion:", end=" ") +b.levelorder(root) +``` + +#### OUTPUT + +``` +Preorder traversal: 10 20 40 50 30 60 +Inorder traversal: 40 20 50 10 60 30 +Postorder traversal: 40 50 20 60 30 10 +Level order traversal: 10 20 30 40 50 60 +Level order traversal after deletion: 10 60 30 40 50 +``` diff --git a/contrib/ds-algorithms/deque.md b/contrib/ds-algorithms/deque.md new file mode 100644 index 00000000..2a5a77d2 --- /dev/null +++ b/contrib/ds-algorithms/deque.md @@ -0,0 +1,216 @@ +# Deque in Python + +## Definition +A deque, short for double-ended queue, is an ordered collection of items that allows rapid insertion and deletion at both ends. + +## Syntax +In Python, deques are implemented in the collections module: + +```py +from collections import deque + +# Creating a deque +d = deque(iterable) # Create deque from iterable (optional) +``` + +## Operations +1. **Appending Elements**: + + - append(x): Adds element x to the right end of the deque. + - appendleft(x): Adds element x to the left end of the deque. + + ### Program + ```py + from collections import deque + + # Initialize a deque + d = deque([1, 2, 3, 4, 5]) + print("Initial deque:", d) + + # Append elements + d.append(6) + print("After append(6):", d) + + # Append left + d.appendleft(0) + print("After appendleft(0):", d) + + ``` + ### Output + ```py + Initial deque: deque([1, 2, 3, 4, 5]) + After append(6): deque([1, 2, 3, 4, 5, 6]) + After appendleft(0): deque([0, 1, 2, 3, 4, 5, 6]) + ``` + +2. **Removing Elements**: + + - pop(): Removes and returns the rightmost element. + - popleft(): Removes and returns the leftmost element. + + ### Program + ```py + from collections import deque + + # Initialize a deque + d = deque([1, 2, 3, 4, 5]) + print("Initial deque:", d) + + # Pop from the right end + rightmost = d.pop() + print("Popped from right end:", rightmost) + print("Deque after pop():", d) + + # Pop from the left end + leftmost = d.popleft() + print("Popped from left end:", leftmost) + print("Deque after popleft():", d) + + ``` + + ### Output + ```py + Initial deque: deque([1, 2, 3, 4, 5]) + Popped from right end: 5 + Deque after pop(): deque([1, 2, 3, 4]) + Popped from left end: 1 + Deque after popleft(): deque([2, 3, 4]) + ``` + +3. **Accessing Elements**: + + - deque[index]: Accesses element at index. + + ### Program + ```py + from collections import deque + + # Initialize a deque + d = deque([1, 2, 3, 4, 5]) + print("Initial deque:", d) + + # Accessing elements + print("Element at index 2:", d[2]) + + ``` + + ### Output + ```py + Initial deque: deque([1, 2, 3, 4, 5]) + Element at index 2: 3 + + ``` + +4. **Other Operations**: + + - extend(iterable): Extends deque by appending elements from iterable. + - extendleft(iterable): Extends deque by appending elements from iterable to the left. + - rotate(n): Rotates deque n steps to the right (negative n rotates left). + + ### Program + ```py + from collections import deque + + # Initialize a deque + d = deque([1, 2, 3, 4, 5]) + print("Initial deque:", d) + + # Extend deque + d.extend([6, 7, 8]) + print("After extend([6, 7, 8]):", d) + + # Extend left + d.extendleft([-1, 0]) + print("After extendleft([-1, 0]):", d) + + # Rotate deque + d.rotate(2) + print("After rotate(2):", d) + + # Rotate left + d.rotate(-3) + print("After rotate(-3):", d) + + ``` + + ### Output + ```py + Initial deque: deque([1, 2, 3, 4, 5]) + After extend([6, 7, 8]): deque([1, 2, 3, 4, 5, 6, 7, 8]) + After extendleft([-1, 0]): deque([0, -1, 1, 2, 3, 4, 5, 6, 7, 8]) + After rotate(2): deque([7, 8, 0, -1, 1, 2, 3, 4, 5, 6]) + After rotate(-3): deque([1, 2, 3, 4, 5, 6, 7, 8, 0, -1]) + + ``` + + +## Example + +### 1. Finding Maximum in Sliding Window +```py +from collections import deque + +def max_sliding_window(nums, k): + if not nums: + return [] + + d = deque() + result = [] + + for i, num in enumerate(nums): + # Remove elements from deque that are out of the current window + if d and d[0] <= i - k: + d.popleft() + + # Remove elements from deque smaller than the current element + while d and nums[d[-1]] <= num: + d.pop() + + d.append(i) + + # Add maximum for current window + if i >= k - 1: + result.append(nums[d[0]]) + + return result + +# Example usage: +nums = [1, 3, -1, -3, 5, 3, 6, 7] +k = 3 +print("Maximums in sliding window of size", k, "are:", max_sliding_window(nums, k)) + +``` + +Output +```py +Maximums in sliding window of size 3 are: [3, 3, 5, 5, 6, 7] +``` + + +## Applications +- **Efficient Queues and Stacks**: Deques allow fast O(1) append and pop operations from both ends, +making them ideal for implementing queues and stacks. +- **Sliding Window Maximum/Minimum**: Used in algorithms that require efficient windowed +computations. + + +## Advantages +- Efficiency: O(1) time complexity for append and pop operations from both ends. +- Versatility: Can function both as a queue and as a stack. +- Flexible: Supports rotation and slicing operations efficiently. + + +## Disadvantages +- Memory Usage: Requires more memory compared to simple lists due to overhead in managing linked +nodes. + +## Conclusion +- Deques in Python, provided by the collections.deque module, offer efficient double-ended queue +operations with O(1) time complexity for append and pop operations on both ends. They are versatile +data structures suitable for implementing queues, stacks, and more complex algorithms requiring +efficient manipulation of elements at both ends. + +- While deques excel in scenarios requiring fast append and pop operations from either end, they do +consume more memory compared to simple lists due to their implementation using doubly-linked lists. +However, their flexibility and efficiency make them invaluable for various programming tasks and +algorithmic solutions. \ No newline at end of file diff --git a/contrib/ds-algorithms/dijkstra.md b/contrib/ds-algorithms/dijkstra.md new file mode 100644 index 00000000..cea6da40 --- /dev/null +++ b/contrib/ds-algorithms/dijkstra.md @@ -0,0 +1,90 @@ + +# Dijkstra's Algorithm +Dijkstra's algorithm is a graph algorithm that gives the shortest distance of each node from the given node in a weighted, undirected graph. It operates by continually choosing the closest unvisited node and determining the distance to all its unvisited neighboring nodes. This algorithm is similar to BFS in graphs, with the difference being it gives priority to nodes with shorter distances by using a priority queue(min-heap) instead of a FIFO queue. The data structures required would be a distance list (to store the minimum distance of each node), a priority queue or a set, and we assume the adjacency list will be provided. + +## Working +- We will store the minimum distance of each node in the distance list, which has a length equal to the number of nodes in the graph. Thus, the minimum distance of the 2nd node will be stored in the 2nd index of the distance list. We initialize the list with the maximum number possible, say infinity. + +- We now start the traversal from the starting node given and mark its distance as 0. We push this node to the priority queue along with its minimum distance, which is 0, so the structure pushed will be (0, node), a tuple. + +- Now, with the help of the adjacency list, we will add the neighboring nodes to the priority queue with the distance equal to (edge weight + current node distance), and this should be less than the distance list value. We will also update the distance list in the process. + +- When all the nodes are added, we will select the node with the shortest distance and repeat the process. + +## Dry Run +We will now do a manual simulation using an example graph given. First, (0, a) is pushed to the priority queue (pq). +![Photo 1](images/Dijkstra's_algorithm_photo1.png) + +- **Step1:** The lowest element is popped from the pq, which is (0, a), and all its neighboring nodes are added to the pq while simultaneously checking the distance list. Thus (3, b), (7, c), (1, d) are added to the pq. +![Photo 2](images/Dijkstra's_algorithm_photo2.png) + +- **Step2:** Again, the lowest element is popped from the pq, which is (1, d). It has two neighboring nodes, a and e, from which + (0 + 1, a) will not be added to the pq as dist[a] = 0 is less than 1. +![Photo 3](images/Dijkstra's_algorithm_photo3.png) + +- **Step3:** Now, the lowest element is popped from the pq, which is (3, b). It has two neighboring nodes, a and c, from which + (0 + 1, a) will not be added to the pq. But the new distance to reach c is 5 (3 + 2), which is less than dist[c] = 7. So (5, c) is added to the pq. +![Photo 4](images/Dijkstra's_algorithm_photo4.png) + +- **Step4:** The next smallest element is (5, c). It has neighbors a and e. The new distance to reach a will be 5 + 7 = 12, which is more than dist[a], so it will not be considered. Similarly, the new distance for e is 5 + 3 = 8, which again will not be considered. So, no new tuple has been added to the pq. +![Photo 5](images/Dijkstra's_algorithm_photo5.png) + +- **Step5:** Similarly, both the elements of the pq will be popped one by one without any new addition. +![Photo 6](images/Dijkstra's_algorithm_photo6.png) +![Photo 7](images/Dijkstra's_algorithm_photo7.png) + +- The distance list we get at the end will be our answer. +- `Output` `dist=[1, 3, 7, 1, 6]` + +## Python Code +```python +import heapq + +def dijkstra(graph, start): + # Create a priority queue + pq = [] + heapq.heappush(pq, (0, start)) + + # Create a dictionary to store distances to each node + dist = {node: float('inf') for node in graph} + dist[start] = 0 + + while pq: + # Get the node with the smallest distance + current_distance, current_node = heapq.heappop(pq) + + # If the current distance is greater than the recorded distance, skip it + if current_distance > dist[current_node]: + continue + + # Update the distances to the neighboring nodes + for neighbor, weight in graph[current_node].items(): + distance = current_distance + weight + # Only consider this new path if it's better + if distance < dist[neighbor]: + dist[neighbor] = distance + heapq.heappush(pq, (distance, neighbor)) + + return dist + +# Example usage: +graph = { + 'A': {'B': 1, 'C': 4}, + 'B': {'A': 1, 'C': 2, 'D': 5}, + 'C': {'A': 4, 'B': 2, 'D': 1}, + 'D': {'B': 5, 'C': 1} +} + +start_node = 'A' +dist = dijkstra(graph, start_node) +print(dist) +``` + +## Complexity Analysis + +- **Time Complexity**: \(O((V + E) log V)\) +- **Space Complexity**: \(O(V + E)\) + + + + diff --git a/contrib/ds-algorithms/dynamic-programming.md b/contrib/ds-algorithms/dynamic-programming.md index 43149f86..f4958689 100644 --- a/contrib/ds-algorithms/dynamic-programming.md +++ b/contrib/ds-algorithms/dynamic-programming.md @@ -51,10 +51,6 @@ print(f"The {n}th Fibonacci number is: {fibonacci(n)}.") - **Time Complexity**: O(n) for both approaches - **Space Complexity**: O(n) for the top-down approach (due to memoization), O(1) for the bottom-up approach -
-
-
- # 2. Longest Common Subsequence The longest common subsequence (LCS) problem is to find the longest subsequence common to two sequences. A subsequence is a sequence that appears in the same relative order but not necessarily contiguous. @@ -84,14 +80,34 @@ Y = "GXTXAYB" print("Length of Longest Common Subsequence:", longest_common_subsequence(X, Y, len(X), len(Y))) ``` +## Longest Common Subsequence Code in Python (Bottom-Up Approach) + +```python + +def longestCommonSubsequence(X, Y, m, n): + L = [[None]*(n+1) for i in range(m+1)] + for i in range(m+1): + for j in range(n+1): + if i == 0 or j == 0: + L[i][j] = 0 + elif X[i-1] == Y[j-1]: + L[i][j] = L[i-1][j-1]+1 + else: + L[i][j] = max(L[i-1][j], L[i][j-1]) + return L[m][n] + + +S1 = "AGGTAB" +S2 = "GXTXAYB" +m = len(S1) +n = len(S2) +print("Length of LCS is", longestCommonSubsequence(S1, S2, m, n)) +``` + ## Complexity Analysis -- **Time Complexity**: O(m * n) for the top-down approach, where m and n are the lengths of the input sequences +- **Time Complexity**: O(m * n) for both approaches, where m and n are the lengths of the input sequences - **Space Complexity**: O(m * n) for the memoization table -
-
-
- # 3. 0-1 Knapsack Problem The 0-1 knapsack problem is a classic optimization problem where the goal is to maximize the total value of items selected while keeping the total weight within a specified limit. @@ -123,10 +139,315 @@ n = len(weights) print("Maximum value that can be obtained:", knapsack(weights, values, capacity, n)) ``` +## 0-1 Knapsack Problem Code in Python (Bottom-up Approach) + +```python +def knapSack(capacity, weights, values, n): + K = [[0 for x in range(capacity + 1)] for x in range(n + 1)] + for i in range(n + 1): + for w in range(capacity + 1): + if i == 0 or w == 0: + K[i][w] = 0 + elif weights[i-1] <= w: + K[i][w] = max(values[i-1] + + K[i-1][w-weights[i-1]], + K[i-1][w]) + else: + K[i][w] = K[i-1][w] + + return K[n][capacity] + +values = [60, 100, 120] +weights = [10, 20, 30] +capacity = 50 +n = len(weights) +print(knapSack(capacity, weights, values, n)) +``` + ## Complexity Analysis -- **Time Complexity**: O(n * W) for the top-down approach, where n is the number of items and W is the capacity of the knapsack +- **Time Complexity**: O(n * W) for both approaches, where n is the number of items and W is the capacity of the knapsack - **Space Complexity**: O(n * W) for the memoization table -
-
-
\ No newline at end of file +# 4. Longest Increasing Subsequence + +The Longest Increasing Subsequence (LIS) is a task is to find the longest subsequence that is strictly increasing, meaning each element in the subsequence is greater than the one before it. This subsequence must maintain the order of elements as they appear in the original sequence but does not need to be contiguous. The goal is to identify the subsequence with the maximum possible length. + +**Algorithm Overview:** +- **Base cases:** If the sequence is empty, the LIS length is 0. +- **Memoization:** Store the results of previously computed subproblems to avoid redundant computations. +- **Recurrence relation:** Compute the LIS length by comparing characters of the sequences and making decisions based on their values. + +## Longest Increasing Subsequence Code in Python (Top-Down Approach using Memoization) + +```python +import sys + +def f(idx, prev_idx, n, a, dp): + if (idx == n): + return 0 + + if (dp[idx][prev_idx + 1] != -1): + return dp[idx][prev_idx + 1] + + notTake = 0 + f(idx + 1, prev_idx, n, a, dp) + take = -sys.maxsize - 1 + if (prev_idx == -1 or a[idx] > a[prev_idx]): + take = 1 + f(idx + 1, idx, n, a, dp) + + dp[idx][prev_idx + 1] = max(take, notTake) + return dp[idx][prev_idx + 1] + +def longestSubsequence(n, a): + + dp = [[-1 for i in range(n + 1)]for j in range(n + 1)] + return f(0, -1, n, a, dp) + +a = [3, 10, 2, 1, 20] +n = len(a) + +print("Length of lis is", longestSubsequence(n, a)) + +``` + +## Longest Increasing Subsequence Code in Python (Bottom-Up Approach) + +```python +def lis(arr): + n = len(arr) + lis = [1]*n + + for i in range(1, n): + for j in range(0, i): + if arr[i] > arr[j] and lis[i] < lis[j] + 1: + lis[i] = lis[j]+1 + + maximum = 0 + for i in range(n): + maximum = max(maximum, lis[i]) + + return maximum + +arr = [10, 22, 9, 33, 21, 50, 41, 60] +print("Length of lis is", lis(arr)) +``` + +## Complexity Analysis +- **Time Complexity**: O(n * n) for both approaches, where n is the length of the array. +- **Space Complexity**: O(n * n) for the memoization table in Top-Down Approach, O(n) in Bottom-Up Approach. + +# 5. String Edit Distance + +The String Edit Distance algorithm calculates the minimum number of operations (insertions, deletions, or substitutions) required to convert one string into another. + +**Algorithm Overview:** +- **Base Cases:** If one string is empty, the edit distance is the length of the other string. +- **Memoization:** Store the results of previously computed edit distances to avoid redundant computations. +- **Recurrence Relation:** Compute the edit distance by considering insertion, deletion, and substitution operations. + +## String Edit Distance Code in Python (Top-Down Approach with Memoization) +```python +def edit_distance(str1, str2, memo={}): + m, n = len(str1), len(str2) + if (m, n) in memo: + return memo[(m, n)] + if m == 0: + return n + if n == 0: + return m + if str1[m - 1] == str2[n - 1]: + memo[(m, n)] = edit_distance(str1[:m-1], str2[:n-1], memo) + else: + memo[(m, n)] = 1 + min(edit_distance(str1, str2[:n-1], memo), # Insert + edit_distance(str1[:m-1], str2, memo), # Remove + edit_distance(str1[:m-1], str2[:n-1], memo)) # Replace + return memo[(m, n)] + +str1 = "sunday" +str2 = "saturday" +print(f"Edit Distance between '{str1}' and '{str2}' is {edit_distance(str1, str2)}.") +``` + +#### Output +``` +Edit Distance between 'sunday' and 'saturday' is 3. +``` + +## String Edit Distance Code in Python (Bottom-Up Approach) +```python +def edit_distance(str1, str2): + m, n = len(str1), len(str2) + dp = [[0 for _ in range(n + 1)] for _ in range(m + 1)] + + for i in range(m + 1): + for j in range(n + 1): + if i == 0: + dp[i][j] = j + elif j == 0: + dp[i][j] = i + elif str1[i - 1] == str2[j - 1]: + dp[i][j] = dp[i - 1][j - 1] + else: + dp[i][j] = 1 + min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]) + + return dp[m][n] + +str1 = "sunday" +str2 = "saturday" +print(f"Edit Distance between '{str1}' and '{str2}' is {edit_distance(str1, str2)}.") +``` + +#### Output +``` +Edit Distance between 'sunday' and 'saturday' is 3. +``` + +## **Complexity Analysis:** +- **Time Complexity:** O(m * n) where m and n are the lengths of string 1 and string 2 respectively +- **Space Complexity:** O(m * n) for both top-down and bottom-up approaches + + +# 6. Matrix Chain Multiplication + +The Matrix Chain Multiplication finds the optimal way to multiply a sequence of matrices to minimize the number of scalar multiplications. + +**Algorithm Overview:** +- **Base Cases:** The cost of multiplying one matrix is zero. +- **Memoization:** Store the results of previously computed matrix chain orders to avoid redundant computations. +- **Recurrence Relation:** Compute the optimal cost by splitting the product at different points and choosing the minimum cost. + +## Matrix Chain Multiplication Code in Python (Top-Down Approach with Memoization) +```python +def matrix_chain_order(p, memo={}): + n = len(p) - 1 + def compute_cost(i, j): + if (i, j) in memo: + return memo[(i, j)] + if i == j: + return 0 + memo[(i, j)] = float('inf') + for k in range(i, j): + q = compute_cost(i, k) + compute_cost(k + 1, j) + p[i - 1] * p[k] * p[j] + if q < memo[(i, j)]: + memo[(i, j)] = q + return memo[(i, j)] + return compute_cost(1, n) + +p = [1, 2, 3, 4] +print(f"Minimum number of multiplications is {matrix_chain_order(p)}.") +``` + +#### Output +``` +Minimum number of multiplications is 18. +``` + + +## Matrix Chain Multiplication Code in Python (Bottom-Up Approach) +```python +def matrix_chain_order(p): + n = len(p) - 1 + m = [[0 for _ in range(n)] for _ in range(n)] + + for L in range(2, n + 1): + for i in range(n - L + 1): + j = i + L - 1 + m[i][j] = float('inf') + for k in range(i, j): + q = m[i][k] + m[k + 1][j] + p[i] * p[k + 1] * p[j + 1] + if q < m[i][j]: + m[i][j] = q + + return m[0][n - 1] + +p = [1, 2, 3, 4] +print(f"Minimum number of multiplications is {matrix_chain_order(p)}.") +``` + +#### Output +``` +Minimum number of multiplications is 18. +``` + +## **Complexity Analysis:** +- **Time Complexity:** O(n^3) where n is the number of matrices in the chain. For an `array p` of dimensions representing the matrices such that the `i-th matrix` has dimensions `p[i-1] x p[i]`, n is `len(p) - 1` +- **Space Complexity:** O(n^2) for both top-down and bottom-up approaches + +# 7. Optimal Binary Search Tree + +The Matrix Chain Multiplication finds the optimal way to multiply a sequence of matrices to minimize the number of scalar multiplications. + +**Algorithm Overview:** +- **Base Cases:** The cost of a single key is its frequency. +- **Memoization:** Store the results of previously computed subproblems to avoid redundant computations. +- **Recurrence Relation:** Compute the optimal cost by trying each key as the root and choosing the minimum cost. + +## Optimal Binary Search Tree Code in Python (Top-Down Approach with Memoization) + +```python +def optimal_bst(keys, freq, memo={}): + n = len(keys) + def compute_cost(i, j): + if (i, j) in memo: + return memo[(i, j)] + if i > j: + return 0 + if i == j: + return freq[i] + memo[(i, j)] = float('inf') + total_freq = sum(freq[i:j+1]) + for r in range(i, j + 1): + cost = (compute_cost(i, r - 1) + + compute_cost(r + 1, j) + + total_freq) + if cost < memo[(i, j)]: + memo[(i, j)] = cost + return memo[(i, j)] + return compute_cost(0, n - 1) + +keys = [10, 12, 20] +freq = [34, 8, 50] +print(f"Cost of Optimal BST is {optimal_bst(keys, freq)}.") +``` + +#### Output +``` +Cost of Optimal BST is 142. +``` + +## Optimal Binary Search Tree Code in Python (Bottom-Up Approach) + +```python +def optimal_bst(keys, freq): + n = len(keys) + cost = [[0 for x in range(n)] for y in range(n)] + + for i in range(n): + cost[i][i] = freq[i] + + for L in range(2, n + 1): + for i in range(n - L + 1): + j = i + L - 1 + cost[i][j] = float('inf') + total_freq = sum(freq[i:j+1]) + for r in range(i, j + 1): + c = (cost[i][r - 1] if r > i else 0) + \ + (cost[r + 1][j] if r < j else 0) + \ + total_freq + if c < cost[i][j]: + cost[i][j] = c + + return cost[0][n - 1] + +keys = [10, 12, 20] +freq = [34, 8, 50] +print(f"Cost of Optimal BST is {optimal_bst(keys, freq)}.") +``` + +#### Output +``` +Cost of Optimal BST is 142. +``` + +### Complexity Analysis +- **Time Complexity**: O(n^3) where n is the number of keys in the binary search tree. +- **Space Complexity**: O(n^2) for both top-down and bottom-up approaches diff --git a/contrib/ds-algorithms/hash-tables.md b/contrib/ds-algorithms/hash-tables.md new file mode 100644 index 00000000..f03b7c5e --- /dev/null +++ b/contrib/ds-algorithms/hash-tables.md @@ -0,0 +1,212 @@ +# Data Structures: Hash Tables, Hash Sets, and Hash Maps + +## Table of Contents +- [Introduction](#introduction) +- [Hash Tables](#hash-tables) + - [Overview](#overview) + - [Operations](#operations) +- [Hash Sets](#hash-sets) + - [Overview](#overview-1) + - [Operations](#operations-1) +- [Hash Maps](#hash-maps) + - [Overview](#overview-2) + - [Operations](#operations-2) +- [Conclusion](#conclusion) + +## Introduction +This document provides an overview of three fundamental data structures in computer science: hash tables, hash sets, and hash maps. These structures are widely used for efficient data storage and retrieval operations. + +## Hash Tables + +### Overview +A **hash table** is a data structure that stores key-value pairs. It uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found. + +### Operations +1. **Insertion**: Add a new key-value pair to the hash table. +2. **Deletion**: Remove a key-value pair from the hash table. +3. **Search**: Find the value associated with a given key. +4. **Update**: Modify the value associated with a given key. + +**Example Code (Python):** +```python +class Node: + def __init__(self, key, value): + self.key = key + self.value = value + self.next = None + + +class HashTable: + def __init__(self, capacity): + self.capacity = capacity + self.size = 0 + self.table = [None] * capacity + + def _hash(self, key): + return hash(key) % self.capacity + + def insert(self, key, value): + index = self._hash(key) + + if self.table[index] is None: + self.table[index] = Node(key, value) + self.size += 1 + else: + current = self.table[index] + while current: + if current.key == key: + current.value = value + return + current = current.next + new_node = Node(key, value) + new_node.next = self.table[index] + self.table[index] = new_node + self.size += 1 + + def search(self, key): + index = self._hash(key) + + current = self.table[index] + while current: + if current.key == key: + return current.value + current = current.next + + raise KeyError(key) + + def remove(self, key): + index = self._hash(key) + + previous = None + current = self.table[index] + + while current: + if current.key == key: + if previous: + previous.next = current.next + else: + self.table[index] = current.next + self.size -= 1 + return + previous = current + current = current.next + + raise KeyError(key) + + def __len__(self): + return self.size + + def __contains__(self, key): + try: + self.search(key) + return True + except KeyError: + return False + + +# Driver code +if __name__ == '__main__': + + ht = HashTable(5) + + ht.insert("apple", 3) + ht.insert("banana", 2) + ht.insert("cherry", 5) + + + print("apple" in ht) + print("durian" in ht) + + print(ht.search("banana")) + + ht.insert("banana", 4) + print(ht.search("banana")) # 4 + + ht.remove("apple") + + print(len(ht)) # 3 +``` + +# Insert elements +hash_table["key1"] = "value1" +hash_table["key2"] = "value2" + +# Search for an element +value = hash_table.get("key1") + +# Delete an element +del hash_table["key2"] + +# Update an element +hash_table["key1"] = "new_value1" + +## Hash Sets + +### Overview +A **hash set** is a collection of unique elements. It is implemented using a hash table where each bucket can store only one element. + +### Operations +1. **Insertion**: Add a new element to the set. +2. **Deletion**: Remove an element from the set. +3. **Search**: Check if an element exists in the set. +4. **Union**: Combine two sets to form a new set with elements from both. +5. **Intersection**: Find common elements between two sets. +6. **Difference**: Find elements present in one set but not in the other. + +**Example Code (Python):** +```python +# Create a hash set +hash_set = set() + +# Insert elements +hash_set.add("element1") +hash_set.add("element2") + +# Search for an element +exists = "element1" in hash_set + +# Delete an element +hash_set.remove("element2") + +# Union of sets +another_set = {"element3", "element4"} +union_set = hash_set.union(another_set) + +# Intersection of sets +intersection_set = hash_set.intersection(another_set) + +# Difference of sets +difference_set = hash_set.difference(another_set) +``` +## Hash Maps + +### Overview +A **hash map** is similar to a hash table but often provides additional functionalities and more user-friendly interfaces for developers. It is a collection of key-value pairs where each key is unique. + +### Operations +1. **Insertion**: Add a new key-value pair to the hash map. +2. **Deletion**: Remove a key-value pair from the hash map. +3. **Search**: Retrieve the value associated with a given key. +4. **Update**: Change the value associated with a given key. + +**Example Code (Python):** +```python +# Create a hash map +hash_map = {} + +# Insert elements +hash_map["key1"] = "value1" +hash_map["key2"] = "value2" + +# Search for an element +value = hash_map.get("key1") + +# Delete an element +del hash_map["key2"] + +# Update an element +hash_map["key1"] = "new_value1" + +``` +## Conclusion +Hash tables, hash sets, and hash maps are powerful data structures that provide efficient means of storing and retrieving data. Understanding these structures and their operations is crucial for developing optimized algorithms and applications. \ No newline at end of file diff --git a/contrib/ds-algorithms/heaps.md b/contrib/ds-algorithms/heaps.md new file mode 100644 index 00000000..6a9ba71a --- /dev/null +++ b/contrib/ds-algorithms/heaps.md @@ -0,0 +1,169 @@ +# Heaps + +## Definition: +Heaps are a crucial data structure that support efficient priority queue operations. They come in two main types: min heaps and max heaps. Python's heapq module provides a robust implementation for min heaps, and with some minor adjustments, it can also be used to implement max heaps. + +## Overview: +A heap is a specialized binary tree-based data structure that satisfies the heap property: + +- **Min Heap:** The key at the root must be the minimum among all keys present in the Binary Heap. This property must be recursively true for all nodes in the Binary Tree. + +- **Max Heap:** The key at the root must be the maximum among all keys present in the Binary Heap. This property must be recursively true for all nodes in the Binary Tree. + +## Python heapq Module: +The heapq module provides an implementation of the heap queue algorithm, also known as the priority queue algorithm. + +- **Min Heap:** In a min heap, the smallest element is always at the root. Here's how to use heapq to create and manipulate a min heap: + + ```python + import heapq + +# Create an empty heap +min_heap = [] + +# Adding elements to the heap + +heapq.heappush(min_heap, 10) +heapq.heappush(min_heap, 5) +heapq.heappush(min_heap, 3) +heapq.heappush(min_heap, 12) +print("Min Heap:", min_heap) + +# Pop the smallest element +smallest = heapq.heappop(min_heap) +print("Smallest element:", smallest) +print("Min Heap after pop:", min_heap) +``` + +**Output:** + + ``` +Min Heap: [3, 5, 10, 12] +Smallest element: 3 +Min Heap after pop: [5, 12, 10] +``` + +- **Max Heap:** To create a max heap, we can store negative values. + +```python +import heapq + +# Create an empty heap +max_heap = [] + +# Adding elements to the heap by pushing negative values +heapq.heappush(max_heap, -10) +heapq.heappush(max_heap, -5) +heapq.heappush(max_heap, -3) +heapq.heappush(max_heap, -12) + +# Convert back to positive values for display +print("Max Heap:", [-x for x in max_heap]) + +# Pop the largest element +largest = -heapq.heappop(max_heap) +print("Largest element:", largest) +print("Max Heap after pop:", [-x for x in max_heap]) + +``` + +**Output:** + +``` +Max Heap: [12, 10, 3, 5] +Largest element: 12 +Max Heap after pop: [10, 5, 3] +``` + +## Heap Operations: +1. **Push Operation:** Adds an element to the heap, maintaining the heap property. +```python +heapq.heappush(heap, item) +``` +2. **Pop Operation:** Removes and returns the smallest element from the heap. +```python +smallest = heapq.heappop(heap) +``` +3. **Heapify Operation:** Converts a list into a heap in-place. +```python +heapq.heapify(list) +``` +4. **Peek Operation:** To get the smallest element without popping it (not directly available, but can be done by accessing the first element). +```python +smallest = heap[0] +``` + +## Example: +```python +# importing "heapq" to implement heap queue +import heapq + +# initializing list +li = [15, 77, 90, 1, 3] + +# using heapify to convert list into heap +heapq.heapify(li) + +# printing created heap +print("The created heap is : ", end="") +print(list(li)) + +# using heappush() to push elements into heap +# pushes 4 +heapq.heappush(li, 4) + +# printing modified heap +print("The modified heap after push is : ", end="") +print(list(li)) + +# using heappop() to pop smallest element +print("The popped and smallest element is : ", end="") +print(heapq.heappop(li)) + +``` + +Output: +``` +The created heap is : [1, 3, 15, 77, 90] +The modified heap after push is : [1, 3, 4, 15, 77, 90] +The popped and smallest element is : 1 +``` + +## Advantages and Disadvantages of Heaps: + +## Advantages: + +**Efficient:** Heap queues, implemented in Python's heapq module, offer remarkable efficiency in managing priority queues and heaps. With logarithmic time complexity for key operations, they are widely favored in various applications for their performance. + +**Space-efficient:** Leveraging an array-based representation, heap queues optimize memory usage compared to node-based structures like linked lists. This design minimizes overhead, enhancing efficiency in memory management. + +**Ease of Use:** Python's heap queues boast a user-friendly API, simplifying fundamental operations such as insertion, deletion, and retrieval. This simplicity contributes to rapid development and code maintenance. + +**Flexibility:** Beyond their primary use in priority queues and heaps, Python's heap queues lend themselves to diverse applications. They can be adapted to implement various data structures, including binary trees, showcasing their versatility and broad utility across different domains. + +## Disadvantages: + +**Limited functionality:** Heap queues are primarily designed for managing priority queues and heaps, and may not be suitable for more complex data structures and algorithms. + +**No random access:** Heap queues do not support random access to elements, making it difficult to access elements in the middle of the heap or modify elements that are not at the top of the heap. + +**No sorting:** Heap queues do not support sorting, so if you need to sort elements in a specific order, you will need to use a different data structure or algorithm. + +**Not thread-safe:** Heap queues are not thread-safe, meaning that they may not be suitable for use in multi-threaded applications where data synchronization is critical. + +## Real-Life Examples of Heaps: + +1. **Priority Queues:** +Heaps are commonly used to implement priority queues, which are used in various algorithms like Dijkstra's shortest path algorithm and Prim's minimum spanning tree algorithm. + +2. **Scheduling Algorithms:** +Heaps are used in job scheduling algorithms where tasks with the highest priority need to be processed first. + +3. **Merge K Sorted Lists:** +Heaps can be used to efficiently merge multiple sorted lists into a single sorted list. + +4. **Real-Time Event Simulation:** +Heaps are used in event-driven simulators to manage events scheduled to occur at future times. + +5. **Median Finding Algorithm:** +Heaps can be used to maintain a dynamic set of numbers to find the median efficiently. diff --git a/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo1.png b/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo1.png new file mode 100644 index 00000000..b937f046 Binary files /dev/null and b/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo1.png differ diff --git a/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo2.png b/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo2.png new file mode 100644 index 00000000..e1cacef1 Binary files /dev/null and b/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo2.png differ diff --git a/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo3.png b/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo3.png new file mode 100644 index 00000000..a5b69f9d Binary files /dev/null and b/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo3.png differ diff --git a/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo4.png b/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo4.png new file mode 100644 index 00000000..54d1889c Binary files /dev/null and b/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo4.png differ diff --git a/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo5.png b/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo5.png new file mode 100644 index 00000000..a3a6d508 Binary files /dev/null and b/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo5.png differ diff --git a/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo6.png b/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo6.png new file mode 100644 index 00000000..db7d948c Binary files /dev/null and b/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo6.png differ diff --git a/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo7.png b/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo7.png new file mode 100644 index 00000000..0b4eaf89 Binary files /dev/null and b/contrib/ds-algorithms/images/Dijkstra's_algorithm_photo7.png differ diff --git a/contrib/ds-algorithms/images/binarytree.png b/contrib/ds-algorithms/images/binarytree.png new file mode 100644 index 00000000..4137cdfc Binary files /dev/null and b/contrib/ds-algorithms/images/binarytree.png differ diff --git a/contrib/ds-algorithms/images/inorder-traversal.png b/contrib/ds-algorithms/images/inorder-traversal.png new file mode 100644 index 00000000..61b32c70 Binary files /dev/null and b/contrib/ds-algorithms/images/inorder-traversal.png differ diff --git a/contrib/ds-algorithms/images/postorder-traversal.png b/contrib/ds-algorithms/images/postorder-traversal.png new file mode 100644 index 00000000..69ac6590 Binary files /dev/null and b/contrib/ds-algorithms/images/postorder-traversal.png differ diff --git a/contrib/ds-algorithms/images/preorder-traversal.png b/contrib/ds-algorithms/images/preorder-traversal.png new file mode 100644 index 00000000..e85a70d8 Binary files /dev/null and b/contrib/ds-algorithms/images/preorder-traversal.png differ diff --git a/contrib/ds-algorithms/images/traversal.png b/contrib/ds-algorithms/images/traversal.png new file mode 100644 index 00000000..556f1775 Binary files /dev/null and b/contrib/ds-algorithms/images/traversal.png differ diff --git a/contrib/ds-algorithms/index.md b/contrib/ds-algorithms/index.md index 9c05b15e..0c29ec75 100644 --- a/contrib/ds-algorithms/index.md +++ b/contrib/ds-algorithms/index.md @@ -15,4 +15,12 @@ - [Trie](trie.md) - [Two Pointer Technique](two-pointer-technique.md) - [Hashing through Linear Probing](hashing-linear-probing.md) -- [Hashing through Chaining](hashing-chaining.md) \ No newline at end of file +- [Hashing through Chaining](hashing-chaining.md) +- [Heaps](heaps.md) +- [Hash Tables, Sets, Maps](hash-tables.md) +- [Binary Tree](binary-tree.md) +- [AVL Trees](avl-trees.md) +- [Splay Trees](splay-trees.md) +- [Dijkstra's Algorithm](dijkstra.md) +- [Deque](deque.md) +- [Tree Traversals](tree-traversal.md) diff --git a/contrib/ds-algorithms/linked-list.md b/contrib/ds-algorithms/linked-list.md index ddbc6d56..59631e7c 100644 --- a/contrib/ds-algorithms/linked-list.md +++ b/contrib/ds-algorithms/linked-list.md @@ -1,6 +1,6 @@ # Linked List Data Structure -Link list is a linear data Structure which can be defined as collection of objects called nodes that are randomly stored in the memory. +Linked list is a linear data Structure which can be defined as collection of objects called nodes that are randomly stored in the memory. A node contains two types of metadata i.e. data stored at that particular address and the pointer which contains the address of the next node in the memory. The last element in a linked list features a null pointer. @@ -36,10 +36,10 @@ The smallest Unit: Node Now, we will see the types of linked list. There are mainly four types of linked list, -1. Singly Link list -2. Doubly link list -3. Circular link list -4. Doubly circular link list +1. Singly linked list +2. Doubly linked list +3. Circular linked list +4. Doubly circular linked list ## 1. Singly linked list. @@ -160,6 +160,18 @@ check the list is empty otherwise shift the head to next node. temp.next = None # Remove the last node by setting the next pointer of the second-last node to None ``` +### Reversing the linked list +```python + def reverseList(self): + prev = None + temp = self.head + while(temp): + nextNode = temp.next #Store the next node + temp.next = prev # Reverse the pointer of current node + prev = temp # Move prev pointer one step forward + temp = nextNode # Move temp pointer one step forward. + self.head = prev # Update the head pointer to last node +``` ### Search in a linked list ```python @@ -174,6 +186,8 @@ check the list is empty otherwise shift the head to next node. return f"Value '{value}' not found in the list" ``` +Connect all the code. + ```python if __name__ == '__main__': llist = LinkedList() @@ -197,13 +211,17 @@ check the list is empty otherwise shift the head to next node. #delete at the end llist.deleteFromEnd() # 2 3 56 9 4 10 - # Print the list + # Print the original list + llist.printList() + llist.reverseList() #10 4 9 56 3 2 + # Print the reversed list llist.printList() ``` ## Output: -2 3 56 9 4 10 +2 3 56 9 4 10 +10 4 9 56 3 2 ## Real Life uses of Linked List diff --git a/contrib/ds-algorithms/sorting-algorithms.md b/contrib/ds-algorithms/sorting-algorithms.md index 55367b82..a3cd72e2 100644 --- a/contrib/ds-algorithms/sorting-algorithms.md +++ b/contrib/ds-algorithms/sorting-algorithms.md @@ -561,3 +561,58 @@ print("Sorted string:", sorted_str) ### Complexity Analysis - **Time Complexity:** O(n+k) for all cases.No matter how the elements are placed in the array, the algorithm goes through n+k times - **Space Complexity:** O(max). Larger the range of elements, larger is the space complexity. + + +## 9. Cyclic Sort + +### Theory +Cyclic Sort is an in-place sorting algorithm that is useful for sorting arrays where the elements are in a known range (e.g., 1 to N). The key idea behind the algorithm is that each number should be placed at its correct index. If we find a number that is not at its correct index, we swap it with the number at its correct index. This process is repeated until every number is at its correct index. + +### Algorithm +- Iterate over the array from the start to the end. +- For each element, check if it is at its correct index. +- If it is not at its correct index, swap it with the element at its correct index. +- Continue this process until the element at the current index is in its correct position. Move to the next index and repeat the process until the end of the array is reached. + +### Steps +- Start with the first element. +- Check if it is at the correct index (i.e., if arr[i] == i + 1). +- If it is not, swap it with the element at the index arr[i] - 1. +- Repeat step 2 for the current element until it is at the correct index. +- Move to the next element and repeat the process. + +### Code + +```python +def cyclic_sort(nums): + i = 0 + while i < len(nums): + correct_index = nums[i] - 1 + if nums[i] != nums[correct_index]: + nums[i], nums[correct_index] = nums[correct_index], nums[i] # Swap + else: + i += 1 + return nums +``` + +### Example +``` +arr = [3, 1, 5, 4, 2] +sorted_arr = cyclic_sort(arr) +print(sorted_arr) + ``` +### Output +``` +[1, 2, 3, 4, 5] +``` + +### Complexity Analysis +**Time Complexity:** + +The time complexity of Cyclic Sort is **O(n)**. +This is because in each cycle, each element is either placed in its correct position or a swap is made. Since each element is swapped at most once, the total number of swaps (and hence the total number of operations) is linear in the number of elements. + +**Space Complexity:** + +The space complexity of Cyclic Sort is **O(1)**. +This is because the algorithm only requires a constant amount of additional space beyond the input array. \ No newline at end of file diff --git a/contrib/ds-algorithms/splay-trees.md b/contrib/ds-algorithms/splay-trees.md new file mode 100644 index 00000000..ee900ed8 --- /dev/null +++ b/contrib/ds-algorithms/splay-trees.md @@ -0,0 +1,162 @@ +# Splay Tree + +In Data Structures and Algorithms, a **Splay Tree** is a self-adjusting binary search tree with the additional property that recently accessed elements are quick to access again. It performs basic operations such as insertion, search, and deletion in O(log n) amortized time. This is achieved by a process called **splaying**, where the accessed node is moved to the root through a series of tree rotations. + +## Points to be Remembered + +- **Splaying**: Moving the accessed node to the root using rotations. +- **Rotations**: Tree rotations (left and right) are used to balance the tree during splaying. +- **Self-adjusting**: The tree adjusts itself with each access, keeping frequently accessed nodes near the root. + +## Real Life Examples of Splay Trees + +- **Cache Implementation**: Frequently accessed data is kept near the top of the tree, making repeated accesses faster. +- **Networking**: Routing tables in network switches can use splay trees to prioritize frequently accessed routes. + +## Applications of Splay Trees + +Splay trees are used in various applications in Computer Science: + +- **Cache Implementations** +- **Garbage Collection Algorithms** +- **Data Compression Algorithms (e.g., LZ78)** + +Understanding these applications is essential for Software Development. + +## Operations in Splay Tree + +Key operations include: + +- **INSERT**: Insert a new element into the splay tree. +- **SEARCH**: Find the position of an element in the splay tree. +- **DELETE**: Remove an element from the splay tree. + +## Implementing Splay Tree in Python + +```python +class SplayTreeNode: + def __init__(self, key): + self.key = key + self.left = None + self.right = None + +class SplayTree: + def __init__(self): + self.root = None + + def insert(self, key): + self.root = self.splay_insert(self.root, key) + + def search(self, key): + self.root = self.splay_search(self.root, key) + return self.root + + def splay(self, root, key): + if not root or root.key == key: + return root + + if root.key > key: + if not root.left: + return root + if root.left.key > key: + root.left.left = self.splay(root.left.left, key) + root = self.rotateRight(root) + elif root.left.key < key: + root.left.right = self.splay(root.left.right, key) + if root.left.right: + root.left = self.rotateLeft(root.left) + return root if not root.left else self.rotateRight(root) + + else: + if not root.right: + return root + if root.right.key > key: + root.right.left = self.splay(root.right.left, key) + if root.right.left: + root.right = self.rotateRight(root.right) + elif root.right.key < key: + root.right.right = self.splay(root.right.right, key) + root = self.rotateLeft(root) + return root if not root.right else self.rotateLeft(root) + + def splay_insert(self, root, key): + if not root: + return SplayTreeNode(key) + + root = self.splay(root, key) + + if root.key == key: + return root + + new_node = SplayTreeNode(key) + + if root.key > key: + new_node.right = root + new_node.left = root.left + root.left = None + else: + new_node.left = root + new_node.right = root.right + root.right = None + + return new_node + + def splay_search(self, root, key): + return self.splay(root, key) + + def rotateRight(self, node): + temp = node.left + node.left = temp.right + temp.right = node + return temp + + def rotateLeft(self, node): + temp = node.right + node.right = temp.left + temp.left = node + return temp + + def preOrder(self, root): + if root: + print(root.key, end=' ') + self.preOrder(root.left) + self.preOrder(root.right) + +#Example usage: +splay_tree = SplayTree() +splay_tree.insert(50) +splay_tree.insert(30) +splay_tree.insert(20) +splay_tree.insert(40) +splay_tree.insert(70) +splay_tree.insert(60) +splay_tree.insert(80) + +print("Preorder traversal of the Splay tree is:") +splay_tree.preOrder(splay_tree.root) + +splay_tree.search(60) + +print("\nSplay tree after search operation for key 60:") +splay_tree.preOrder(splay_tree.root) +``` + +## Output + +```markdown +Preorder traversal of the Splay tree is: +50 30 20 40 70 60 80 + +Splay tree after search operation for key 60: +60 50 30 20 40 70 80 +``` + +## Complexity Analysis + +The worst-case time complexities of the main operations in a Splay Tree are as follows: + +- **Insertion**: (O(n)). In the worst case, insertion may take linear time if the tree is highly unbalanced. +- **Search**: (O(n)). In the worst case, searching for a node may take linear time if the tree is highly unbalanced. +- **Deletion**: (O(n)). In the worst case, deleting a node may take linear time if the tree is highly unbalanced. + +While these operations can take linear time in the worst case, the splay operation ensures that the tree remains balanced over a sequence of operations, leading to better average-case performance. \ No newline at end of file diff --git a/contrib/ds-algorithms/tree-traversal.md b/contrib/ds-algorithms/tree-traversal.md new file mode 100644 index 00000000..4ec72ee8 --- /dev/null +++ b/contrib/ds-algorithms/tree-traversal.md @@ -0,0 +1,195 @@ +# Tree Traversal Algorithms + +Tree Traversal refers to the process of visiting or accessing each node of the tree exactly once in a certain order. Tree traversal algorithms help us to visit and process all the nodes of the tree. Since tree is not a linear data structure, there are multiple nodes which we can visit after visiting a certain node. There are multiple tree traversal techniques which decide the order in which the nodes of the tree are to be visited. + + +A Tree Data Structure can be traversed in following ways: + - **Level Order Traversal or Breadth First Search or BFS** + - **Depth First Search or DFS** + - Inorder Traversal + - Preorder Traversal + - Postorder Traversal + + ![Tree Traversal](images/traversal.png) + + + +## Binary Tree Structure + Before diving into traversal techniques, let's define a simple binary tree node structure: + +![Binary Tree](images/binarytree.png)) + + ```python +class Node: + def __init__(self, key): + self.leftChild = None + self.rightChild = None + self.data = key + +# Main class +if __name__ == "__main__": + root = Node(1) + root.leftChild = Node(2) + root.rightChild = Node(3) + root.leftChild.leftChild = Node(4) + root.leftChild.rightChild = Node(5) + root.rightChild.leftChild = Node(6) + root.rightChild.rightChild = Node(6) +``` + +## Level Order Traversal +When the nodes of the tree are wrapped in a level-wise mode from left to right, then it represents the level order traversal. We can use a queue data structure to execute a level order traversal. + +### Algorithm + - Create an empty queue Q + - Enqueue the root node of the tree to Q + - Loop while Q is not empty + - Dequeue a node from Q and visit it + - Enqueue the left child of the dequeued node if it exists + - Enqueue the right child of the dequeued node if it exists + +### code for level order traversal in python +```python +def printLevelOrder(root): + if root is None: + return + + # Create an empty queue + queue = [] + + # Enqueue Root and initialize height + queue.append(root) + + while(len(queue) > 0): + + # Print front of queue and + # remove it from queue + print(queue[0].data, end=" ") + node = queue.pop(0) + + # Enqueue left child + if node.left is not None: + queue.append(node.left) + + # Enqueue right child + if node.right is not None: + queue.append(node.right) +``` + +**output** + +` Inorder traversal of binary tree is : +1 2 3 4 5 6 7 ` + + + +## Depth First Search +When we do a depth-first traversal, we travel in one direction up to the bottom first, then turn around and go the other way. There are three kinds of depth-first traversals. + +## 1. Inorder Traversal + +In this traversal method, the left subtree is visited first, then the root and later the right sub-tree. We should always remember that every node may represent a subtree itself. + +`Note :` If a binary search tree is traversed in-order, the output will produce sorted key values in an ascending order. + +![Inorder](images/inorder-traversal.png) + +**The order:** Left -> Root -> Right + +### Algorithm + - Traverse the left subtree. + - Visit the root node. + - Traverse the right subtree. + +### code for inorder traversal in python +```python +def printInorder(root): + if root: + # First recur on left child + printInorder(root.left) + + # Then print the data of node + print(root.val, end=" "), + + # Now recur on right child + printInorder(root.right) +``` + +**output** + +` Inorder traversal of binary tree is : +4 2 5 1 6 3 7 ` + + +## 2. Preorder Traversal + +In this traversal method, the root node is visited first, then the left subtree and finally the right subtree. + +![preorder](images/preorder-traversal.png)) + +**The order:** Root -> Left -> Right + +### Algorithm + - Visit the root node. + - Traverse the left subtree. + - Traverse the right subtree. + +### code for preorder traversal in python +```python +def printPreorder(root): + if root: + # First print the data of node + print(root.val, end=" "), + + # Then recur on left child + printPreorder(root.left) + + # Finally recur on right child + printPreorder(root.right) +``` + +**output** + +` Inorder traversal of binary tree is : +1 2 4 5 3 6 7 ` + +## 3. Postorder Traversal + +In this traversal method, the root node is visited last, hence the name. First we traverse the left subtree, then the right subtree and finally the root node. + +![postorder](images/postorder-traversal.png) + +**The order:** Left -> Right -> Root + +### Algorithm + - Traverse the left subtree. + - Traverse the right subtree. + - Visit the root node. + +### code for postorder traversal in python +```python +def printPostorder(root): + if root: + # First recur on left child + printPostorder(root.left) + + # The recur on right child + printPostorder(root.right) + + # Now print the data of node + print(root.val, end=" ") +``` + +**output** + +` Inorder traversal of binary tree is : +4 5 2 6 7 3 1 ` + + +## Complexity Analysis + - **Time Complexity:** All three tree traversal methods (Inorder, Preorder, and Postorder) have a time complexity of `𝑂(𝑛)`, where 𝑛 is the number of nodes in the tree. + - **Space Complexity:** The space complexity is influenced by the recursion stack. In the worst case, the depth of the recursion stack can go up to `𝑂(ℎ)`, where ℎ is the height of the tree. + + + + diff --git a/contrib/machine-learning/assets/XG_1.webp b/contrib/machine-learning/assets/XG_1.webp new file mode 100644 index 00000000..c693d3da Binary files /dev/null and b/contrib/machine-learning/assets/XG_1.webp differ diff --git a/contrib/machine-learning/assets/eda/bi-variate-analysis.png b/contrib/machine-learning/assets/eda/bi-variate-analysis.png new file mode 100644 index 00000000..076cc505 Binary files /dev/null and b/contrib/machine-learning/assets/eda/bi-variate-analysis.png differ diff --git a/contrib/machine-learning/assets/eda/correlation-analysis.png b/contrib/machine-learning/assets/eda/correlation-analysis.png new file mode 100644 index 00000000..e6f3ee60 Binary files /dev/null and b/contrib/machine-learning/assets/eda/correlation-analysis.png differ diff --git a/contrib/machine-learning/assets/eda/multi-variate-analysis.png b/contrib/machine-learning/assets/eda/multi-variate-analysis.png new file mode 100644 index 00000000..5dc042b9 Binary files /dev/null and b/contrib/machine-learning/assets/eda/multi-variate-analysis.png differ diff --git a/contrib/machine-learning/assets/eda/uni-variate-analysis1.png b/contrib/machine-learning/assets/eda/uni-variate-analysis1.png new file mode 100644 index 00000000..b4905dcf Binary files /dev/null and b/contrib/machine-learning/assets/eda/uni-variate-analysis1.png differ diff --git a/contrib/machine-learning/assets/eda/uni-variate-analysis2.png b/contrib/machine-learning/assets/eda/uni-variate-analysis2.png new file mode 100644 index 00000000..cf56c70a Binary files /dev/null and b/contrib/machine-learning/assets/eda/uni-variate-analysis2.png differ diff --git a/contrib/machine-learning/assets/km_.png b/contrib/machine-learning/assets/km_.png new file mode 100644 index 00000000..3f674126 Binary files /dev/null and b/contrib/machine-learning/assets/km_.png differ diff --git a/contrib/machine-learning/assets/km_2.png b/contrib/machine-learning/assets/km_2.png new file mode 100644 index 00000000..cf786cf2 Binary files /dev/null and b/contrib/machine-learning/assets/km_2.png differ diff --git a/contrib/machine-learning/assets/km_3.png b/contrib/machine-learning/assets/km_3.png new file mode 100644 index 00000000..ecd34ff5 Binary files /dev/null and b/contrib/machine-learning/assets/km_3.png differ diff --git a/contrib/machine-learning/assets/knm.png b/contrib/machine-learning/assets/knm.png new file mode 100644 index 00000000..4b7a2190 Binary files /dev/null and b/contrib/machine-learning/assets/knm.png differ diff --git a/contrib/machine-learning/eda.md b/contrib/machine-learning/eda.md new file mode 100644 index 00000000..1559a099 --- /dev/null +++ b/contrib/machine-learning/eda.md @@ -0,0 +1,184 @@ +# Exploratory Data Analysis + +Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. EDA is used to understand the data, get a sense of the data, and to identify relationships between variables. EDA is a crucial step in the data analysis process and should be done before building a model. + +## Why is EDA important? + +1. **Understand the data**: EDA helps to understand the data, its structure, and its characteristics. + +2. **Identify patterns and relationships**: EDA helps to identify patterns and relationships between variables. + +3. **Detect outliers and anomalies**: EDA helps to detect outliers and anomalies in the data. + +4. **Prepare data for modeling**: EDA helps to prepare the data for modeling by identifying missing values, handling missing values, and transforming variables. + +## Steps in EDA + +1. **Data Collection**: Collect the data from various sources. + +2. **Data Cleaning**: Clean the data by handling missing values, removing duplicates, and transforming variables. + +3. **Data Exploration**: Explore the data by visualizing the data, summarizing the data, and identifying patterns and relationships. + +4. **Data Analysis**: Analyze the data by performing statistical analysis, hypothesis testing, and building models. + +5. **Data Visualization**: Visualize the data using various plots and charts to understand the data better. + +## Tools for EDA + +1. **Python**: Python is a popular programming language for data analysis and has many libraries for EDA, such as Pandas, NumPy, Matplotlib, Seaborn, and Plotly. + +2. **Jupiter Notebook**: Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. + +## Techniques for EDA + +1. **Descriptive Statistics**: Descriptive statistics summarize the main characteristics of a data set, such as mean, median, mode, standard deviation, and variance. + +2. **Data Visualization**: Data visualization is the graphical representation of data to understand the data better, such as histograms, scatter plots, box plots, and heat maps. + +3. **Correlation Analysis**: Correlation analysis is used to measure the strength and direction of the relationship between two variables. + +4. **Hypothesis Testing**: Hypothesis testing is used to test a hypothesis about a population parameter based on sample data. + +5. **Dimensionality Reduction**: Dimensionality reduction is the process of reducing the number of variables in a data set while retaining as much information as possible. + +6. **Clustering Analysis**: Clustering analysis is used to group similar data points together based on their characteristics. + +## Commonly Used Techniques in EDA + +1. **Uni-variate Analysis**: Uni-variate analysis is the simplest form of data analysis that involves analyzing a single variable at a time. + +2. **Bi-variate Analysis**: Bi-variate analysis involves analyzing two variables at a time to understand the relationship between them. + +3. **Multi-variate Analysis**: Multi-variate analysis involves analyzing more than two variables at a time to understand the relationship between them. + +## Understand with an Example + +Let's understand EDA with an example. Here we use a famous dataset called Iris dataset. + +The dataset consists of 150 samples of iris flowers, where each sample represents measurements of four features (variables) for three species of iris flowers. + +The four features measured are : +Sepal length (in cm) Sepal width (in cm) Petal length (in cm) Petal width (in cm). + +The three species of iris flowers included in the dataset are : +**Setosa**, **Versicolor**, **Virginica** + +```python +# Import libraries +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import seaborn as sns +from sklearn import datasets + +# Load the Iris dataset +iris = datasets.load_iris() +df = pd.DataFrame(iris.data, columns=iris.feature_names) +df.head() +``` + +| Sepal Length (cm) | Sepal Width (cm) | Petal Length (cm) | Petal Width (cm) | +|-------------------|------------------|-------------------|------------------| +| 5.1 | 3.5 | 1.4 | 0.2 | +| 4.9 | 3.0 | 1.4 | 0.2 | +| 4.7 | 3.2 | 1.3 | 0.2 | +| 4.6 | 3.1 | 1.5 | 0.2 | +| 5.0 | 3.6 | 1.4 | 0.2 | + + +### Uni-variate Analysis + +```python +# Uni-variate Analysis +df_setosa=df.loc[df['species']=='setosa'] +df_virginica=df.loc[df['species']=='virginica'] +df_versicolor=df.loc[df['species']=='versicolor'] + +plt.plot(df_setosa['sepal_length']) +plt.plot(df_virginica['sepal_length']) +plt.plot(df_versicolor['sepal_length']) +plt.xlabel('sepal length') +plt.show() +``` +![Uni-variate Analysis](assets/eda/uni-variate-analysis1.png) + +```python +plt.hist(df_setosa['petal_length']) +plt.hist(df_virginica['petal_length']) +plt.hist(df_versicolor['petal_length']) +plt.xlabel('petal length') +plt.show() +``` +![Uni-variate Analysis](assets/eda/uni-variate-analysis2.png) + +### Bi-variate Analysis + +```python +# Bi-variate Analysis +sns.FacetGrid(df,hue="species",height=5).map(plt.scatter,"petal_length","sepal_width").add_legen() +plt.show() +``` +![Bi-variate Analysis](assets/eda/bi-variate-analysis.png) + +### Multi-variate Analysis + +```python +# Multi-variate Analysis +sns.pairplot(df,hue="species",height=3) +``` +![Multi-variate Analysis](assets/eda/multi-variate-analysis.png) + +### Correlation Analysis + +```python +# Correlation Analysis +corr_matrix = df.corr() +sns.heatmap(corr_matrix) +``` +| | sepal_length | sepal_width | petal_length | petal_width | +|-------------|--------------|-------------|--------------|-------------| +| sepal_length| 1.000000 | -0.109369 | 0.871754 | 0.817954 | +| sepal_width | -0.109369 | 1.000000 | -0.420516 | -0.356544 | +| petal_length| 0.871754 | -0.420516 | 1.000000 | 0.962757 | +| petal_width | 0.817954 | -0.356544 | 0.962757 | 1.000000 | + +![Correlation Analysis](assets/eda/correlation-analysis.png) + +## Exploratory Data Analysis (EDA) Report on Iris Dataset + +### Introduction +The Iris dataset consists of 150 samples of iris flowers, each characterized by four features: Sepal Length, Sepal Width, Petal Length, and Petal Width. These samples belong to three species of iris flowers: Setosa, Versicolor, and Virginica. In this EDA report, we explore the dataset to gain insights into the characteristics and relationships among the features and species. + +### Uni-variate Analysis +Uni-variate analysis examines each variable individually. +- Sepal Length: The distribution of Sepal Length varies among the different species, with Setosa generally having shorter sepals compared to Versicolor and Virginica. +- Petal Length: Setosa tends to have shorter petal lengths, while Versicolor and Virginica have relatively longer petal lengths. + +### Bi-variate Analysis +Bi-variate analysis explores the relationship between two variables. +- Petal Length vs. Sepal Width: There is a noticeable separation between species, especially Setosa, which typically has shorter and wider sepals compared to Versicolor and Virginica. +- This analysis suggests potential patterns distinguishing the species based on these two features. + +### Multi-variate Analysis +Multi-variate analysis considers interactions among multiple variables simultaneously. +- Pairplot: The pairplot reveals distinctive clusters for each species, particularly in the combinations of Petal Length and Petal Width, indicating clear separation among species based on these features. + +### Correlation Analysis +Correlation analysis examines the relationship between variables. +- Correlation Heatmap: There are strong positive correlations between Petal Length and Petal Width, as well as between Petal Length and Sepal Length. Sepal Width shows a weaker negative correlation with Petal Length and Petal Width. + +### Insights +1. Petal dimensions (length and width) exhibit strong correlations, suggesting that they may collectively contribute more significantly to distinguishing between iris species. +2. Setosa tends to have shorter and wider sepals compared to Versicolor and Virginica. +3. The combination of Petal Length and Petal Width appears to be a more effective discriminator among iris species, as indicated by the distinct clusters observed in multi-variate analysis. + +### Conclusion +Through comprehensive exploratory data analysis, we have gained valuable insights into the Iris dataset, highlighting key characteristics and relationships among features and species. Further analysis and modeling could leverage these insights to develop robust classification models for predicting iris species based on their measurements. + +## Conclusion + +Exploratory Data Analysis (EDA) is a critical step in the data analysis process that helps to understand the data, identify patterns and relationships, detect outliers, and prepare the data for modeling. By using various techniques and tools, such as descriptive statistics, data visualization, correlation analysis, and hypothesis testing, EDA provides valuable insights into the data, enabling data scientists to make informed decisions and build accurate models. + + + diff --git a/contrib/machine-learning/index.md b/contrib/machine-learning/index.md index d3ae1099..7ee61ebe 100644 --- a/contrib/machine-learning/index.md +++ b/contrib/machine-learning/index.md @@ -2,15 +2,13 @@ - [Introduction to scikit-learn](sklearn-introduction.md) - [Binomial Distribution](binomial-distribution.md) +- [Naive Bayes](naive-bayes.md) - [Regression in Machine Learning](regression.md) +- [Polynomial Regression](polynomial-regression.md) - [Confusion Matrix](confusion-matrix.md) - [Decision Tree Learning](decision-tree.md) - [Random Forest](random-forest.md) - [Support Vector Machine Algorithm](support-vector-machine.md) -- [Artificial Neural Network from the Ground Up](ann.md) -- [Introduction To Convolutional Neural Networks (CNNs)](intro-to-cnn.md) -- [TensorFlow.md](tensorflow.md) -- [PyTorch.md](pytorch.md) - [Ensemble Learning](ensemble-learning.md) - [Types of optimizers](types-of-optimizers.md) - [Logistic Regression](logistic-regression.md) @@ -18,6 +16,15 @@ - [Clustering](clustering.md) - [Hierarchical Clustering](hierarchical-clustering.md) - [Grid Search](grid-search.md) -- [Transformers](transformers.md) +- [K-Means](kmeans.md) - [K-nearest neighbor (KNN)](knn.md) +- [Xgboost](xgboost.md) +- [Artificial Neural Network from the Ground Up](ann.md) +- [Introduction To Convolutional Neural Networks (CNNs)](intro-to-cnn.md) +- [TensorFlow](tensorflow.md) +- [PyTorch](pytorch.md) +- [PyTorch Fundamentals](pytorch-fundamentals.md) +- [Transformers](transformers.md) +- [Reinforcement Learning](reinforcement-learning.md) - [Neural network regression](neural-network-regression.md) +- [Exploratory Data Analysis](eda.md) diff --git a/contrib/machine-learning/kmeans.md b/contrib/machine-learning/kmeans.md new file mode 100644 index 00000000..52db92e5 --- /dev/null +++ b/contrib/machine-learning/kmeans.md @@ -0,0 +1,92 @@ +# K-Means Clustering +Unsupervised Learning Algorithm for Grouping Similar Data. + +## Introduction +K-means clustering is a fundamental unsupervised machine learning algorithm that excels at grouping similar data points together. It's a popular choice due to its simplicity and efficiency in uncovering hidden patterns within unlabeled datasets. + +## Unsupervised Learning +Unlike supervised learning algorithms that rely on labeled data for training, unsupervised algorithms, like K-means, operate solely on input data (without predefined categories). Their objective is to discover inherent structures or groupings within the data. + +## The K-Means Objective +Organize similar data points into clusters to unveil underlying patterns. The main objective is to minimize total intra-cluster variance or the squared function. + +![image](assets/knm.png) +## Clusters and Centroids +A cluster represents a collection of data points that share similar characteristics. K-means identifies a pre-determined number (k) of clusters within the dataset. Each cluster is represented by a centroid, which acts as its central point (imaginary or real). + +## Minimizing In-Cluster Variation +The K-means algorithm strategically assigns each data point to a cluster such that the total variation within each cluster (measured by the sum of squared distances between points and their centroid) is minimized. In simpler terms, K-means strives to create clusters where data points are close to their respective centroids. + +## The Meaning Behind "K-Means" +The "means" in K-means refers to the averaging process used to compute the centroid, essentially finding the center of each cluster. + +## K-Means Algorithm in Action +![image](assets/km_.png) +The K-means algorithm follows an iterative approach to optimize cluster formation: + +1. **Initial Centroid Placement:** The process begins with randomly selecting k centroids to serve as initial reference points for each cluster. +2. **Data Point Assignment:** Each data point is assigned to the closest centroid, effectively creating a preliminary clustering. +3. **Centroid Repositioning:** Once data points are assigned, the centroids are recalculated by averaging the positions of the points within their respective clusters. These new centroids represent the refined centers of the clusters. +4. **Iteration Until Convergence:** Steps 2 and 3 are repeated iteratively until a stopping criterion is met. This criterion can be either: + - **Centroid Stability:** No significant change occurs in the centroids' positions, indicating successful clustering. + - **Reaching Maximum Iterations:** A predefined number of iterations is completed. + +## Code +Following is a simple implementation of K-Means. + +```python +# Generate and Visualize Sample Data +# import the necessary Libraries + +import numpy as np +import matplotlib.pyplot as plt + +# Create data points for cluster 1 and cluster 2 +X = -2 * np.random.rand(100, 2) +X1 = 1 + 2 * np.random.rand(50, 2) + +# Combine data points from both clusters +X[50:100, :] = X1 + +# Plot data points and display the plot +plt.scatter(X[:, 0], X[:, 1], s=50, c='b') +plt.show() + +# K-Means Model Creation and Training +from sklearn.cluster import KMeans + +# Create KMeans object with 2 clusters +kmeans = KMeans(n_clusters=2) +kmeans.fit(X) # Train the model on the data + +# Visualize Data Points with Centroids +centroids = kmeans.cluster_centers_ # Get centroids (cluster centers) + +plt.scatter(X[:, 0], X[:, 1], s=50, c='b') # Plot data points again +plt.scatter(centroids[0, 0], centroids[0, 1], s=200, c='g', marker='s') # Plot centroid 1 +plt.scatter(centroids[1, 0], centroids[1, 1], s=200, c='r', marker='s') # Plot centroid 2 +plt.show() # Display the plot with centroids + +# Predict Cluster Label for New Data Point +new_data = np.array([-3.0, -3.0]) +new_data_reshaped = new_data.reshape(1, -1) +predicted_cluster = kmeans.predict(new_data_reshaped) +print("Predicted cluster for new data:", predicted_cluster) +``` + +### Output: +Before Implementing K-Means Clustering +![Before Implementing K-Means Clustering](assets/km_2.png) + +After Implementing K-Means Clustering +![After Implementing K-Means Clustering](assets/km_3.png) + +Predicted cluster for new data: `[0]` + +## Conclusion +**K-Means** can be applied to data that has a smaller number of dimensions, is numeric, and is continuous or can be used to find groups that have not been explicitly labeled in the data. As an example, it can be used for Document Classification, Delivery Store Optimization, or Customer Segmentation. + +## References + +- [Survey of Machine Learning and Data Mining Techniques used in Multimedia System](https://www.researchgate.net/publication/333457161_Survey_of_Machine_Learning_and_Data_Mining_Techniques_used_in_Multimedia_System?_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6Il9kaXJlY3QiLCJwYWdlIjoiX2RpcmVjdCJ9fQ) +- [A Clustering Approach for Outliers Detection in a Big Point-of-Sales Database](https://www.researchgate.net/publication/339267868_A_Clustering_Approach_for_Outliers_Detection_in_a_Big_Point-of-Sales_Database?_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6Il9kaXJlY3QiLCJwYWdlIjoiX2RpcmVjdCJ9fQ) diff --git a/contrib/machine-learning/naive-bayes.md b/contrib/machine-learning/naive-bayes.md new file mode 100644 index 00000000..4bf0f04c --- /dev/null +++ b/contrib/machine-learning/naive-bayes.md @@ -0,0 +1,328 @@ +# Naive Bayes + +## Introduction + +The Naive Bayes model uses probabilities to predict an outcome.It is a supervised machine learning technique, i.e. it reqires labelled data for training. It is used for classification and is based on the Bayes' Theorem. The basic assumption of this model is the independence among the features, i.e. a feature is unaffected by any other feture. + +## Bayes' Theorem + +Bayes' theorem is given by: + +$$ +P(a|b) = \frac{P(b|a)*P(a)}{P(b)} +$$ + +where: +- $P(a|b)$ is the posterior probability, i.e. probability of 'a' given that 'b' is true, +- $P(b|a)$ is the likelihood probability i.e. probability of 'b' given that 'a' is true, +- $P(a)$ and $P(b)$ are the probabilities of 'a' and 'b' respectively, independent of each other. + + +## Applications + +Naive Bayes classifier has numerous applications including : + 1. Text classification. + 2. Sentiment analysis. + 3. Spam filtering. + 4. Multiclass classification (eg. Weather prediction). + 5. Recommendation Systems. + 6. Healthcare sector. + 7. Document categorization. + + +## Advantages + + 1. Easy to implement. + 2. Useful even if training dataset is limited (where a decision tree would not be recommended). + 3. Supports multiclass classification which is not supported by some machine learning algorithms like SVM and logistic regression. + 4. Scalable, fast and efficient. + +## Disadvantages + + 1. Assumes features to be independent, which may not be true in certain scenarios. + 2. Zero probability error. + 3. Sensitive to noise. + +## Zero Probability Error + + Zero probability error is said to occur if in some case the number of occurances of an event given another event is zero. + To handle zero probability error, Laplace's correction is used by adding a small constant . + +**Example:** + + +Given the data below, find whether tennis can be played if ( outlook=overcast, wind=weak ). + +**Data** + +--- +| SNo | Outlook (A) | Wind (B) | PlayTennis (R) | +|-----|--------------|------------|-------------------| +| 1 | Rain | Weak | No | +| 2 | Rain | Strong | No | +| 3 | Overcast | Weak | Yes | +| 4 | Rain | Weak | Yes | +| 5 | Overcast | Weak | Yes | +| 6 | Rain | Strong | No | +| 7 | Overcast | Strong | Yes | +| 8 | Rain | Weak | No | +| 9 | Overcast | Weak | Yes | +| 10 | Rain | Weak | Yes | +--- + +- **Calculate prior probabilities** + +$$ + P(Yes) = \frac{6}{10} = 0.6 +$$ +$$ + P(No) = \frac{4}{10} = 0.4 +$$ + +- **Calculate likelihoods** + + 1.**Outlook (A):** + + --- + | A\R | Yes | No | + |-----------|-------|-----| + | Rain | 2 | 4 | + | Overcast | 4 | 0 | + | Total | 6 | 4 | + --- + +- Rain: + +$$P(Rain|Yes) = \frac{2}{6}$$ + +$$P(Rain|No) = \frac{4}{4}$$ + +- Overcast: + +$$ + P(Overcast|Yes) = \frac{4}{6} +$$ +$$ + P(Overcast|No) = \frac{0}{4} +$$ + + +Here, we can see that P(Overcast|No) = 0 +This is a zero probability error! + +Since probability is 0, naive bayes model fails to predict. + + **Applying Laplace's correction:** + + In Laplace's correction, we scale the values for 1000 instances. + - **Calculate prior probabilities** + + $$P(Yes) = \frac{600}{1002}$$ + + $$P(No) = \frac{402}{1002}$$ + +- **Calculate likelihoods** + + 1. **Outlook (A):** + + + ( Converted to 1000 instances ) + + We will add 1 instance each to the (PlayTennis|No) column {Laplace's correction} + + --- + | A\R | Yes | No | + |-----------|-------|---------------| + | Rain | 200 | (400+1)=401 | + | Overcast | 400 | (0+1)=1 | + | Total | 600 | 402 | + --- + + - **Rain:** + + $$P(Rain|Yes) = \frac{200}{600}$$ + $$P(Rain|No) = \frac{401}{402}$$ + + - **Overcast:** + + $$P(Overcast|Yes) = \frac{400}{600}$$ + $$P(Overcast|No) = \frac{1}{402}$$ + + + 2. **Wind (B):** + + + --- + | B\R | Yes | No | + |-----------|---------|-------| + | Weak | 500 | 200 | + | Strong | 100 | 200 | + | Total | 600 | 400 | + --- + + - **Weak:** + + $$P(Weak|Yes) = \frac{500}{600}$$ + $$P(Weak|No) = \frac{200}{400}$$ + + - **Strong:** + + $$P(Strong|Yes) = \frac{100}{600}$$ + $$P(Strong|No) = \frac{200}{400}$$ + + - **Calculting probabilities:** + + $$P(PlayTennis|Yes) = P(Yes) * P(Overcast|Yes) * P(Weak|Yes)$$ + $$= \frac{600}{1002} * \frac{400}{600} * \frac{500}{600}$$ + $$= 0.3326$$ + + $$P(PlayTennis|No) = P(No) * P(Overcast|No) * P(Weak|No)$$ + $$= \frac{402}{1002} * \frac{1}{402} * \frac{200}{400}$$ + $$= 0.000499 = 0.0005$$ + + +Since , +$$P(PlayTennis|Yes) > P(PlayTennis|No)$$ +we can conclude that tennis can be played if outlook is overcast and wind is weak. + + +# Types of Naive Bayes classifier + + +## Guassian Naive Bayes + + It is used when the dataset has **continuous data**. It assumes that the data is distributed normally (also known as guassian distribution). + A guassian distribution can be characterized by a bell-shaped curve. + + **Continuous data features :** Features which can take any real values within a certain range. These features have an infinite number of possible values.They are generally measured, not counted. + eg. weight, height, temperature, etc. + + **Code** + + ```python + +#import libraries +import pandas as pd +from sklearn.model_selection import train_test_split +from sklearn.naive_bayes import GaussianNB +from sklearn import metrics +from sklearn.metrics import confusion_matrix + +#read data +d=pd.read_csv("data.csv") +df=pd.DataFrame(d) + +X = df.iloc[:,1:7:1] +y = df.iloc[:,7:8:1] + +# splitting X and y into training and testing sets +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42) + + +# training the model on training set +obj = GaussianNB() +obj.fit(X_train, y_train) + +#making predictions on the testing set +y_pred = obj.predict(X_train) + +#comparing y_test and y_pred +print("Gaussian Naive Bayes model accuracy:", metrics.accuracy_score(y_train, y_pred)) +print("Confusion matrix: \n",confusion_matrix(y_train,y_pred)) + + ``` + + +## Multinomial Naive Bayes + + Appropriate when the features are categorical or countable. It models the likelihood of each feature as a multinomial distribution. + Multinomial distribution is used to find probabilities of each category, given multiple categories (eg. Text classification). + + **Code** + + ```python + +#import libraries +import pandas as pd +from sklearn.model_selection import train_test_split +from sklearn.naive_bayes import MultinomialNB +from sklearn import metrics +from sklearn.metrics import confusion_matrix + +#read data +d=pd.read_csv("data.csv") +df=pd.DataFrame(d) + +X = df.iloc[:,1:7:1] +y = df.iloc[:,7:8:1] + +# splitting X and y into training and testing sets +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42) + + +# training the model on training set +obj = MultinomialNB() +obj.fit(X_train, y_train) + +#making predictions on the testing set +y_pred = obj.predict(X_train) + +#comparing y_test and y_pred +print("Gaussian Naive Bayes model accuracy:", metrics.accuracy_score(y_train, y_pred)) +print("Confusion matrix: \n",confusion_matrix(y_train,y_pred)) + + + ``` + +## Bernoulli Naive Bayes + + It is specifically designed for binary features (eg. Yes or No). It models the likelihood of each feature as a Bernoulli distribution. + Bernoulli distribution is used when there are only two possible outcomes (eg. success or failure of an event). + + **Code** + + ```python + +#import libraries +import pandas as pd +from sklearn.model_selection import train_test_split +from sklearn.naive_bayes import BernoulliNB +from sklearn import metrics +from sklearn.metrics import confusion_matrix + +#read data +d=pd.read_csv("data.csv") +df=pd.DataFrame(d) + +X = df.iloc[:,1:7:1] +y = df.iloc[:,7:8:1] + +# splitting X and y into training and testing sets +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42) + + +# training the model on training set +obj = BernoulliNB() +obj.fit(X_train, y_train) + +#making predictions on the testing set +y_pred = obj.predict(X_train) + +#comparing y_test and y_pred +print("Gaussian Naive Bayes model accuracy:", metrics.accuracy_score(y_train, y_pred)) +print("Confusion matrix: \n",confusion_matrix(y_train,y_pred)) + + ``` + + +## Evaluation + + 1. Confusion matrix. + 2. Accuracy. + 3. ROC curve. + + +## Conclusion + + We can conclude that naive bayes may limit in some cases due to the assumption that the features are independent of each other but still reliable in many cases. Naive Bayes is an efficient classifier and works even on small datasets. + diff --git a/contrib/machine-learning/polynomial-regression.md b/contrib/machine-learning/polynomial-regression.md new file mode 100644 index 00000000..d00ede3b --- /dev/null +++ b/contrib/machine-learning/polynomial-regression.md @@ -0,0 +1,102 @@ +# Polynomial Regression + +Polynomial Regression is a form of regression analysis in which the relationship between the independent variable $x$ and the dependent variable $y$ is modeled as an $nth$ degree polynomial. This guide provides an overview of polynomial regression, including its fundamental concepts, assumptions, and how to implement it using Python. + +## Introduction + +Polynomial Regression is used when the data shows a non-linear relationship between the independent variable $x$ and the dependent variable $y$ is modeled as an $nth$ degree polynomial. It extends the simple linear regression model by considering polynomial terms of the independent variable, allowing for a more flexible fit to the data. + +## Concepts + +### Polynomial Equation + +The polynomial regression model is based on the following polynomial equation: + +$$ +\[ y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \cdots + \beta_n x^n + \epsilon \] +$$ + +Where: +- $y$ is the dependent variable. +- $x$ is the independent variable. +- $\beta_0, \beta_1, \ldots, \beta_n$ are the coefficients of the polynomial. +- $\epsilon$ is the error term. + +### Degree of Polynomial + +The degree of the polynomial (n) determines the flexibility of the model. A higher degree allows the model to fit more complex, non-linear relationships, but it also increases the risk of overfitting. + +### Overfitting and Underfitting + +- **Overfitting**: When the model fits the noise in the training data too closely, resulting in poor generalization to new data. +- **Underfitting**: When the model is too simple to capture the underlying pattern in the data. + +## Assumptions + +1. **Independence**: Observations are independent of each other. +2. **Homoscedasticity**: The variance of the residuals (errors) is constant across all levels of the independent variable. +3. **Normality**: The residuals of the model are normally distributed. +4. **No Multicollinearity**: The predictor variables are not highly correlated with each other. + +## Implementation + +### Using Scikit-learn + +Scikit-learn is a popular machine learning library in Python that provides tools for polynomial regression. + +### Code Example + +```python +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt +from sklearn.preprocessing import PolynomialFeatures +from sklearn.linear_model import LinearRegression +from sklearn.metrics import mean_squared_error, r2_score + +# Load dataset +data = pd.read_csv('path/to/your/dataset.csv') + +# Define features and target variable +X = data[['feature']] +y = data['target'] + +# Transform features to polynomial features +poly = PolynomialFeatures(degree=3) +X_poly = poly.fit_transform(X) + +# Initialize and train polynomial regression model +model = LinearRegression() +model.fit(X_poly, y) + +# Make predictions +y_pred = model.predict(X_poly) + +# Evaluate the model +mse = mean_squared_error(y, y_pred) +r2 = r2_score(y, y_pred) +print("Mean Squared Error:", mse) +print("R^2 Score:", r2) + +# Visualize the results +plt.scatter(X, y, color='blue') +plt.plot(X, y_pred, color='red') +plt.xlabel('Feature') +plt.ylabel('Target') +plt.title('Polynomial Regression') +plt.show() +``` + +## Evaluation Metrics + +- **Mean Squared Error (MSE)**: The average of the squared differences between actual and predicted values. +- **R-squared (R²) Score**: A statistical measure that represents the proportion of the variance for the dependent variable that is explained by the independent variables in the model. + +## Conclusion + +Polynomial Regression is a powerful tool for modeling non-linear relationships between variables. It is important to choose the degree of the polynomial carefully to balance between underfitting and overfitting. Understanding and properly evaluating the model using appropriate metrics ensures its effectiveness. + +## References + +- [Scikit-learn Documentation](https://scikit-learn.org/stable/modules/linear_model.html#polynomial-regression) +- [Wikipedia: Polynomial Regression](https://en.wikipedia.org/wiki/Polynomial_reg) diff --git a/contrib/machine-learning/pytorch-fundamentals.md b/contrib/machine-learning/pytorch-fundamentals.md new file mode 100644 index 00000000..b244ec1f --- /dev/null +++ b/contrib/machine-learning/pytorch-fundamentals.md @@ -0,0 +1,469 @@ +# PyTorch Fundamentals + + +```python +# Import pytorch in our codespace +import torch +print(torch.__version__) +``` + +#### Output +``` +2.3.0+cu121 +``` + + +2.3.0 is the pytorch version and 121 is the cuda version + +Now you have already seen how to create a tensor in pytorch. In this notebook i am going to show you the operations which can be applied on a tensor with a quick previous revision. + +### 1. Creating tensors + +Scalar tensor ( a zero dimension tensor) + +```python +scalar = torch.tensor(7) +print(scalar) +``` + +#### Output +``` +tensor(7) +``` + +Check the dimension of the above tensor + +```python +print(scalar.ndim) +``` + +#### Output +``` +0 +``` + +To retrieve the number from the tensor we use `item()` + +```python +print(scalar.item()) +``` + +#### Output +``` +7 +``` + +Vector (It is a single dimension tensor but contain many numbers) + +```python +vector = torch.tensor([1,2]) +print(vector) +``` + +#### Output +``` +tensor([1, 2]) +``` + +Check the dimensions + +```python +print(vector.ndim) +``` + +#### Output +``` +1 +``` + +Check the shape of the vector + +```python +print(vector.shape) +``` + +#### Output +``` +torch.Size([2]) +``` + + +The above returns torch.Size([2]) which means our vector has a shape of [2]. This is because of the two elements we placed inside the square brackets ([1,2]) + +Note: +I'll let you in on a trick. + +You can tell the number of dimensions a tensor in PyTorch has by the number of square brackets on the outside ([) and you only need to count one side. + + +```python +# Let's create a matrix +MATRIX = torch.tensor([[1,2], + [4,5]]) +print(MATRIX) +``` + +#### Output +``` +tensor([[1, 2], + [4, 5]]) +``` + +There are two brackets so it must be 2 dimensions , lets check + + +```python +print(MATRIX.ndim) +``` + +#### Output +``` +2 +``` + + +```python +# Shape +print(MATRIX.shape) +``` + +#### Output +``` +torch.Size([2, 2]) +``` + +It means MATRIX has 2 rows and 2 columns. + +Let's create a TENSOR + +```python +TENSOR = torch.tensor([[[1,2,3], + [4,5,6], + [7,8,9]]]) +print(TENSOR) +``` + +#### Output +``` +tensor([[[1, 2, 3], + [4, 5, 6], + [7, 8, 9]]]) +``` + +Let's check the dimensions +```python +print(TENSOR.ndim) +``` + +#### Output +``` +3 +``` + +shape +```python +print(TENSOR.shape) +``` + +#### Output +``` +torch.Size([1, 3, 3]) +``` + +The dimensions go outer to inner. + +That means there's 1 dimension of 3 by 3. + +##### Let's summarise + +* scalar -> a single number having 0 dimension. +* vector -> have many numbers but having 1 dimension. +* matrix -> a array of numbers having 2 dimensions. +* tensor -> a array of numbers having n dimensions. + +### Random Tensors + +We can create them using `torch.rand()` and passing in the `size` parameter. + +Creating a random tensor of size (3,4) +```python +rand_tensor = torch.rand(size = (3,4)) +print(rand_tensor) +``` + +#### Output +``` +tensor([[0.7462, 0.4950, 0.7851, 0.8277], + [0.6112, 0.5159, 0.1728, 0.6847], + [0.4472, 0.1612, 0.6481, 0.3236]]) +``` + +Check the dimensions + +```python +print(rand_tensor.ndim) +``` + +#### Output +``` +2 +``` + +Shape +```python +print(rand_tensor.shape) +``` + +#### Output +``` +torch.Size([3, 4]) +``` + +Datatype +```python +print(rand_tensor.dtype) +``` + +#### Output +``` +torch.float32 +``` + +### Zeros and ones + +Here we will create a tensor of any shape filled with zeros and ones + + +```python +# Create a tensor of all zeros +zeros = torch.zeros(size = (3,4)) +print(zeros) +``` + +#### Output +``` +tensor([[0., 0., 0., 0.], + [0., 0., 0., 0.], + [0., 0., 0., 0.]]) +``` + +Create a tensor of ones +```python +ones = torch.ones(size = (3,4)) +print(ones) +``` + +#### Output +``` +tensor([[1., 1., 1., 1.], + [1., 1., 1., 1.], + [1., 1., 1., 1.]]) +``` + +### Create a tensor having range of numbers + +You can use `torch.arange(start, end, step)` to do so. + +Where: + +* start = start of range (e.g. 0) +* end = end of range (e.g. 10) +* step = how many steps in between each value (e.g. 1) + +> Note: In Python, you can use range() to create a range. However in PyTorch, torch.range() is deprecated show error, show use `torch.arange()` + + +```python +zero_to_ten = torch.arange(start = 0, + end = 10, + step = 1) +print(zero_to_ten) +``` + +#### Output +``` +tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) +``` + +# 2. Manipulating tensors (tensor operations) + +The operations are : + +* Addition +* Substraction +* Multiplication (element-wise) +* Division +* Matrix multiplication + +### 1. Addition + + +```python +tensor = torch.tensor([1,2,3]) +print(tensor+10) +``` + +#### Output +``` +tensor([11, 12, 13]) +``` + +We have add 10 to each tensor element. + + +```python +tensor1 = torch.tensor([4,5,6]) +print(tensor+tensor1) +``` + +#### Output +``` +tensor([5, 7, 9]) +``` + +We have added two tensors , remember that addition takes place element wise. + +### 2. Subtraction + + +```python +print(tensor-8) +``` + +#### Output +``` +tensor([-7, -6, -5]) +``` + +We've subtracted 8 from the above tensor. + + +```python +print(tensor-tensor1) +``` + +#### Output +``` +tensor([-3, -3, -3]) +``` + +### 3. Multiplication + + +```python +# Multiply the tensor with 10 (element wise) +print(tensor*10) +``` + +#### Output +``` +tensor([10, 20, 30]) +``` + +Each element of tensor gets multiplied by 10. + +Note: + +PyTorch also has a bunch of built-in functions like `torch.mul()` (short for multiplication) and `torch.add()` to perform basic operations. + + +```python +# let's see them +print(torch.add(tensor,10)) +``` + +#### Output +``` +tensor([11, 12, 13]) +``` + + +```python +print(torch.mul(tensor,10)) +``` + +#### Output +``` +tensor([10, 20, 30]) +``` + +### Matrix multiplication (is all you need) +One of the most common operations in machine learning and deep learning algorithms (like neural networks) is matrix multiplication. + +PyTorch implements matrix multiplication functionality in the `torch.matmul()` method. + +The main two rules for matrix multiplication to remember are: + +The inner dimensions must match: +* (3, 2) @ (3, 2) won't work +* (2, 3) @ (3, 2) will work +* (3, 2) @ (2, 3) will work +The resulting matrix has the shape of the outer dimensions: +* (2, 3) @ (3, 2) -> (2, 2) +* (3, 2) @ (2, 3) -> (3, 3) + + +Note: "@" in Python is the symbol for matrix multiplication. + + +```python +# let's perform the matrix multiplication +tensor1 = torch.tensor([[[1,2,3], + [4,5,6], + [7,8,9]]]) +tensor2 = torch.tensor([[[1,1,1], + [2,2,2], + [3,3,3]]]) + +print(tensor1) , print(tensor2) + +``` + +#### Output +``` +tensor([[[1, 2, 3], + [4, 5, 6], + [7, 8, 9]]]) +tensor([[[1, 1, 1], + [2, 2, 2], + [3, 3, 3]]]) +``` + +Let's check the shape +```python +print(tensor1.shape) , print(tensor2.shape) +``` + +#### Output +``` +torch.Size([1, 3, 3]) +torch.Size([1, 3, 3]) +``` + +Matrix multiplication +```python +print(torch.matmul(tensor1, tensor2)) +``` + +#### Output +``` +tensor([[[14, 14, 14], + [32, 32, 32], + [50, 50, 50]]]) +``` + +Can also use the "@" symbol for matrix multiplication, though not recommended +```python +print(tensor1 @ tensor2) +``` + +#### Output +``` +tensor([[[14, 14, 14], + [32, 32, 32], + [50, 50, 50]]]) +``` + +Note: + +If shape is not perfect you can transpose the tensor and perform the matrix multiplication. diff --git a/contrib/machine-learning/reinforcement-learning.md b/contrib/machine-learning/reinforcement-learning.md new file mode 100644 index 00000000..c5529fc9 --- /dev/null +++ b/contrib/machine-learning/reinforcement-learning.md @@ -0,0 +1,233 @@ +# Reinforcement Learning: A Comprehensive Guide + +Reinforcement Learning (RL) is a field of Machine Learing which focuses on goal-directed learning from interaction with the environment. In RL, an agent learns to make decisions by performing actions in an environment to maximize cumulative numerical reward signal. This README aims to provide a thorough understanding of RL, covering key concepts, algorithms, applications, and resources. + +## What is Reinforcement Learning? + +Reinforcement learning involves determining the best actions to take in various situations to maximize a numerical reward signal. Instead of being instructed on which actions to take, the learner must explore and identify the actions that lead to the highest rewards through trial and error. After each action performed in its environment, a trainer may give feedback in the form of rewards or penalties to indicate the desirability of the resulting state. Unlike supervised learning, reinforcement learning does not depend on labeled data but instead learns from the outcomes of its actions. + +## Key Concepts and Terminology + +### Agent +Agent is a system or entity that learns to make decisions by interacting with an environment. The agent improves its performance by trial and error, receiving feedback from the environment in the form of rewards or punishments. + +### Environment +Environment is the setting or world in which the agent operates and interacts with. It provides the agent with states and feedback based on the agent's actions. + +### State +State represents the current situation of the environment, encapsulating all the relevant information needed for decision-making. + +### Action +Action represents a move that can be taken by the agent, which would affect the state of the environment. The set of all possible actions is called the action space. + +### Reward +Reward is the feedback from the environment in response to the agent’s action, thereby defining what are good and bad actions. Agent aims to maximize the total reward over time. + +### Policy +Policy is a strategy used by the agent to determine its actions based on the current state. In some cases the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. + +### Value Function +The value function of a state is the expected total amount of reward an agent can expect to accumulate over the future, starting from that state. There are two main types of value functions: + - **State Value Function (V)**: The expected reward starting from a state and following a certain policy thereafter. + - **Action Value Function (Q)**: The expected reward starting from a state, taking a specific action, and following a certain policy thereafter. + +### Model +Model mimics the behavior of the environment, or more generally, that allows inferences to be made about how the environment will behave. + +### Exploration vs. Exploitation +To accumulate substantial rewards, a reinforcement learning agent needs to favor actions that have previously yielded high rewards. However, to identify these effective actions, the agent must also attempt actions it hasn't tried before. This means the agent must *exploit* its past experiences to gain rewards, while also *exploring* new actions to improve its future decision-making. + +## Types of Reinforcement Learning + +### Model-Based vs Model-Free + +**Model-Based Reinforcement Learning:** Model-based methods involve creating a model of the environment to predict future states and rewards, allowing the agent to plan its actions by simulating various scenarios. These methods often involve two main components: + +**Model-Free Reinforcement Learning:** Model-free methods do not explicitly learn a model of the environment. Instead, they learn a policy or value function directly from the interactions with the environment. These methods can be further divided into two categories: value-based and policy-based methods. + +### Value-Based Methods: +Value-based methods focus on estimating the value function, and the policy is indirectly derived from the value function. + +### Policy-Based Methods: +Policy-based methods directly optimize the policy by maximizing the expected cumulative rewardto find the optimal parameters. + +### Actor-Critic Methods: +Actor-Critic methods combine the strengths of both value-based and policy-based methods. Actor learns the policy that maps states to actions and Critic learns the value function that evaluates the action chosen by the actor. + +## Important Algorithms + +### Q-Learning +Q-Learning is a model-free algorithm used in reinforcement learning to learn the value of an action in a particular state. It aims to find the optimal policy by iteratively updating the Q-values, which represent the expected cumulative reward of taking a particular action in a given state and following the optimal policy thereafter. + +#### Algorithm: +1. Initialize Q-values arbitrarily for all state-action pairs. +2. Repeat for each episode: + - Choose an action using an exploration strategy (e.g., epsilon-greedy). + - Take the action, observe the reward and the next state. + - Update the Q-value of the current state-action pair using the Bellman equation: + $$Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right)$$ + + where: + - $Q(s, a)$ is the Q-value of state $s$ and action $a$. + - $r$ is the observed reward. + - $s'$ is the next state. + - $\alpha$ is the learning rate. + - $\gamma$ is the discount factor. +3. Until convergence or a maximum number of episodes. + +### SARSA +SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference algorithm used for learning the Q-function. Unlike Q-learning, SARSA directly updates the Q-values based on the current policy. + +#### Algorithm: +1. Initialize Q-values arbitrarily for all state-action pairs. +2. Repeat for each episode: + - Initialize the environment state $s$. + - Choose an action $a$ using the current policy (e.g., epsilon-greedy). + - Repeat for each timestep: + - Take action $a$, observe the reward $r$ and the next state $s'$. + - Choose the next action $a'$ using the current policy. + - Update the Q-value of the current state-action pair using the SARSA update rule: + $$Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma Q(s', a') - Q(s, a) \right)$$ +3. Until convergence or a maximum number of episodes. + +### REINFORCE Algorithm: +REINFORCE (Monte Carlo policy gradient) is a simple policy gradient method that updates the policy parameters in the direction of the gradient of expected rewards. + +### Proximal Policy Optimization (PPO): +PPO is an advanced policy gradient method that improves stability by limiting the policy updates within a certain trust region. + +### A2C/A3C: +Advantage Actor-Critic (A2C) and Asynchronous Advantage Actor-Critic (A3C) are variants of actor-critic methods that utilize multiple parallel agents to improve sample efficiency. + +## Mathematical Background + +### Markov Decision Processes (MDPs) +A Markov Decision Process (MDP) is a mathematical framework used to model decision-making problems. It consists of states, actions, rewards and transition probabilities. + +### Bellman Equations +Bellman equations are fundamental recursive equations in dynamic programming and reinforcement learning. They express the value of a decision at one point in time in terms of the expected value of the subsequent decisions. + +## Applications of Reinforcement Learning + +### Gaming +Reinforcement learning is extensively used in gaming for developing AI agents capable of playing complex games like AlphaGo, Chess, and video games. RL algorithms enable these agents to learn optimal strategies by interacting with the game environment and receiving feedback in the form of rewards. + +### Robotics +In robotics, reinforcement learning is employed to teach robots various tasks such as navigation, manipulation, and control. RL algorithms allow robots to learn from their interactions with the environment, enabling them to adapt and improve their behavior over time without explicit programming. + +### Finance +Reinforcement learning plays a crucial role in finance, particularly in algorithmic trading and portfolio management. RL algorithms are utilized to optimize trading strategies, automate decision-making processes, and manage investment portfolios dynamically based on changing market conditions and objectives. + +### Healthcare +In healthcare, reinforcement learning is utilized for various applications such as personalized treatment, drug discovery, and optimizing healthcare operations. RL algorithms can assist in developing personalized treatment plans for patients, identifying effective drug candidates, and optimizing resource allocation in hospitals to improve patient care and outcomes. + +## Tools and Libraries +- **OpenAI Gym:** A toolkit for developing and comparing RL algorithms. +- **TensorFlow/TF-Agents:** A library for RL in TensorFlow. +- **PyTorch:** Popular machine learning library with RL capabilities. +- **Stable Baselines3:** A set of reliable implementations of RL algorithms in PyTorch. + +## How to Start with Reinforcement Learning + +### Prerequisites +- Basic knowledge of machine learning and neural networks. +- Proficiency in Python. + +### Beginner Project +The provided Python code implements the Q-learning algorithm for a basic grid world environment. It defines the grid world, actions, and parameters such as discount factor and learning rate. The algorithm iteratively learns the optimal action-value function (Q-values) by updating them based on rewards obtained from actions taken in each state. Finally, the learned Q-values are printed for each state-action pair. + +```python +import numpy as np + +# Define the grid world environment +# 'S' represents the start state +# 'G' represents the goal state +# 'H' represents the hole (negative reward) +# '.' represents empty cells (neutral reward) +# 'W' represents walls (impassable) +grid_world = np.array([ + ['S', '.', '.', '.', '.'], + ['.', 'W', '.', 'H', '.'], + ['.', '.', '.', 'W', '.'], + ['.', 'W', '.', '.', 'G'] +]) + +# Define the actions (up, down, left, right) +actions = ['UP', 'DOWN', 'LEFT', 'RIGHT'] + +# Define parameters +gamma = 0.9 # discount factor +alpha = 0.1 # learning rate +epsilon = 0.1 # exploration rate + +# Initialize Q-values +num_rows, num_cols = grid_world.shape +num_actions = len(actions) +Q = np.zeros((num_rows, num_cols, num_actions)) + +# Define helper function to get possible actions in a state +def possible_actions(state): + row, col = state + possible_actions = [] + for i, action in enumerate(actions): + if action == 'UP' and row > 0 and grid_world[row - 1, col] != 'W': + possible_actions.append(i) + elif action == 'DOWN' and row < num_rows - 1 and grid_world[row + 1, col] != 'W': + possible_actions.append(i) + elif action == 'LEFT' and col > 0 and grid_world[row, col - 1] != 'W': + possible_actions.append(i) + elif action == 'RIGHT' and col < num_cols - 1 and grid_world[row, col + 1] != 'W': + possible_actions.append(i) + return possible_actions + +# Q-learning algorithm +num_episodes = 1000 +for episode in range(num_episodes): + # Initialize the starting state + state = (0, 0) # start state + while True: + # Choose an action using epsilon-greedy policy + if np.random.uniform(0, 1) < epsilon: + action = np.random.choice(possible_actions(state)) + else: + action = np.argmax(Q[state[0], state[1]]) + + # Perform the action and observe the next state and reward + if actions[action] == 'UP': + next_state = (state[0] - 1, state[1]) + elif actions[action] == 'DOWN': + next_state = (state[0] + 1, state[1]) + elif actions[action] == 'LEFT': + next_state = (state[0], state[1] - 1) + elif actions[action] == 'RIGHT': + next_state = (state[0], state[1] + 1) + + # Get the reward + if grid_world[next_state] == 'G': + reward = 1 # goal state + elif grid_world[next_state] == 'H': + reward = -1 # hole + else: + reward = 0 + + # Update Q-value using the Bellman equation + best_next_action = np.argmax(Q[next_state[0], next_state[1]]) + Q[state[0], state[1], action] += alpha * ( + reward + gamma * Q[next_state[0], next_state[1], best_next_action] - Q[state[0], state[1], action]) + + # Move to the next state + state = next_state + + # Check if the episode is terminated + if grid_world[state] in ['G', 'H']: + break + +# Print the learned Q-values +print("Learned Q-values:") +for i in range(num_rows): + for j in range(num_cols): + print(f"State ({i}, {j}):", Q[i, j]) +``` + +## Conclusion +Congratulations on completing your journey through this comprehensive guide to reinforcement learning! Armed with this knowledge, you are well-equipped to dive deeper into the exciting world of RL, whether it's for gaming, robotics, finance, healthcare, or any other domain. Keep exploring, experimenting, and learning, and remember, the only limit to what you can achieve with reinforcement learning is your imagination. diff --git a/contrib/machine-learning/xgboost.md b/contrib/machine-learning/xgboost.md new file mode 100644 index 00000000..1eb7f09a --- /dev/null +++ b/contrib/machine-learning/xgboost.md @@ -0,0 +1,92 @@ +# XGBoost +XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. + +## Introduction to Gradient Boosting +Gradient boosting is a powerful technique for building predictive models that has seen widespread success in various applications. +- **Boosting Concept**: Boosting originated from the idea of modifying weak learners to improve their predictive capability. +- **AdaBoost**: The first successful boosting algorithm was Adaptive Boosting (AdaBoost), which utilizes decision stumps as weak learners. +- **Gradient Boosting Machines (GBM)**: AdaBoost and related algorithms were later reformulated as Gradient Boosting Machines, casting boosting as a numerical optimization problem. +- **Algorithm Elements**: + - _Loss function_: Determines the objective to minimize (e.g., cross-entropy for classification, mean squared error for regression). + - _Weak learner_: Typically, decision trees are used as weak learners. + - _Additive model_: New weak learners are added iteratively to minimize the loss function, correcting the errors of previous models. + +## Introduction to XGBoost +- eXtreme Gradient Boosting (XBGoost): a more **regularized form** of Gradient Boosting, as it uses **advanced regularization (L1&L2)**, improving the model’s **generalization capabilities.** +- It’s suitable when there is **a large number of training samples and a small number of features**; or when there is **a mixture of categorical and numerical features**. +- **Development**: Created by Tianqi Chen, XGBoost is designed for computational speed and model performance. +- **Key Features**: + - _Speed_: Achieved through careful engineering, including parallelization of tree construction, distributed computing, and cache optimization. + - _Support for Variations_: XGBoost supports various techniques and optimizations. + - _Out-of-Core Computing_: Can handle very large datasets that don't fit into memory. +- **Advantages**: + - _Sparse Optimization_: Suitable for datasets with many zero values. + - _Regularization_: Implements advanced regularization techniques (L1 and L2), enhancing generalization capabilities. + - _Parallel Training_: Utilizes all CPU cores during training for faster processing. + - _Multiple Loss Functions_: Supports different loss functions based on the problem type. + - _Bagging and Early Stopping_: Additional techniques for improving performance and efficiency. +- **Pre-Sorted Decision Tree Algorithm**: + 1. Features are pre-sorted by their values. + 2. Traversing segmentation points involves finding the best split point on a feature with a cost of O(#data). + 3. Data is split into left and right child nodes after finding the split point. + 4. Pre-sorting allows for accurate split point determination. + - **Limitations**: + 1. Iterative Traversal: Each iteration requires traversing the entire training data multiple times. + 2. Memory Consumption: Loading the entire training data into memory limits size, while not loading it leads to time-consuming read/write operations. + 3. Space Consumption: Pre-sorting consumes space, storing feature sorting results and split gain calculations. + XGBoosting: + ![image](assets/XG_1.webp) + +## Develop Your First XGBoost Model +This code uses the XGBoost library to train a model on the Iris dataset, splitting the data, setting hyperparameters, training the model, making predictions, and evaluating accuracy, achieving an accuracy score of X on the testing set. + +```python +# XGBoost with Iris Dataset +# Importing necessary libraries +import numpy as np +import xgboost as xgb +from sklearn.datasets import load_iris +from sklearn.model_selection import train_test_split +from sklearn.metrics import accuracy_score + +# Loading a sample dataset (Iris dataset) +data = load_iris() +X = data.data +y = data.target + +# Splitting the dataset into training and testing sets +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) + +# Converting the dataset into DMatrix format +dtrain = xgb.DMatrix(X_train, label=y_train) +dtest = xgb.DMatrix(X_test, label=y_test) + +# Setting hyperparameters for XGBoost +params = { + 'max_depth': 3, + 'eta': 0.1, + 'objective': 'multi:softmax', + 'num_class': 3 +} + +# Training the XGBoost model +num_round = 50 +model = xgb.train(params, dtrain, num_round) + +# Making predictions on the testing set +y_pred = model.predict(dtest) + +# Evaluating the model +accuracy = accuracy_score(y_test, y_pred) +print("Accuracy:", accuracy) +``` + +### Output + + Accuracy: 1.0 + +## **Conclusion** +XGBoost's focus on speed, performance, and scalability has made it one of the most widely used and powerful predictive modeling algorithms available. Its ability to handle large datasets efficiently, along with its advanced features and optimizations, makes it a valuable tool in machine learning and data science. + +## Reference +- [Machine Learning Prediction of Turning Precision Using Optimized XGBoost Model](https://www.mdpi.com/2076-3417/12/15/7739) diff --git a/contrib/numpy/index.md b/contrib/numpy/index.md index 50e80460..a7c1161b 100644 --- a/contrib/numpy/index.md +++ b/contrib/numpy/index.md @@ -11,3 +11,6 @@ - [Sorting NumPy Arrays](sorting-array.md) - [NumPy Array Iteration](array-iteration.md) - [Concatenation of Arrays](concatenation-of-arrays.md) +- [Splitting of Arrays](splitting-arrays.md) +- [Universal Functions (Ufunc)](universal-functions.md) +- [Statistical Functions on Arrays](statistical-functions.md) diff --git a/contrib/numpy/splitting-arrays.md b/contrib/numpy/splitting-arrays.md new file mode 100644 index 00000000..3228cb71 --- /dev/null +++ b/contrib/numpy/splitting-arrays.md @@ -0,0 +1,135 @@ +# Splitting Arrays + +Splitting a NumPy array refers to dividing the array into smaller sub-arrays. This can be done in various ways, along specific rows, columns, or even based on conditions applied to the elements. + +There are several ways to split a NumPy array in Python using different functions. Some of these methods include: + +- Splitting a NumPy array using `numpy.split()` +- Splitting a NumPy array using `numpy.array_split()` +- Splitting a NumPy array using `numpy.vsplit()` +- Splitting a NumPy array using `numpy.hsplit()` +- Splitting a NumPy array using `numpy.dsplit()` + +## NumPy split() + +The `numpy.split()` function divides an array into equal parts along a specified axis. + +**Code** +```python +import numpy as np +array = np.array([1,2,3,4,5,6]) +#Splitting the array into 3 equal parts along axis=0 +result = np.split(array,3) +print(result) +``` + +**Output** +``` +[array([1, 2]), array([3, 4]), array([5, 6])] +``` + +## NumPy array_split() + +The `numpy.array_split()` function divides an array into equal or nearly equal sub-arrays. Unlike `numpy.split()`, it allows for uneven splitting, making it useful when the array cannot be evenly divided by the specified number of splits. + +**Code** +```python +import numpy as np +array = np.array([1,2,3,4,5,6,7,8]) +#Splitting the array into 3 unequal parts along axis=0 +result = np.array_split(array,3) +print(result) +``` + +**Output** +``` +[array([1, 2, 3]), array([4, 5, 6]), array([7, 8])] +``` + +## NumPy vsplit() + +The `numpy.vsplit()`, which is vertical splitting (row-wise), divides an array along the vertical axis (axis=0). + +**Code** +```python +import numpy as np +array = np.array([[1, 2, 3], + [4, 5, 6], + [7, 8, 9], + [10, 11, 12]]) +#Vertically Splitting the array into 2 subarrays along axis=0 +result = np.vsplit(array,2) +print(result) +``` + +**Output** +``` +[array([[1, 2, 3], + [4, 5, 6]]), array([[ 7, 8, 9], + [10, 11, 12]])] +``` + + +## NumPy hsplit() + +The `numpy.hsplit()`, which is horizontal splitting (column-wise), divides an array along the horizontal axis (axis=1). + +**Code** +```python +import numpy as np +array = np.array([[1, 2, 3, 4], + [5, 7, 8, 9], + [11,12,13,14]]) +#Horizontally Splitting the array into 4 subarrays along axis=1 +result = np.hsplit(array,4) +print(result) +``` + +**Output** +``` +[array([[ 1], + [ 5], + [11]]), array([[ 2], + [ 7], + [12]]), array([[ 3], + [ 8], + [13]]), array([[ 4], + [ 9], + [14]])] +``` + +## NumPy dsplit() + +The`numpy.dsplit()` is employed for splitting arrays along the third axis (axis=2), which is applicable for 3D arrays and beyond. + +**Code** +```python +import numpy as np +#3D array +array = np.array([[[ 1, 2, 3, 4,], + [ 5, 6, 7, 8,], + [ 9, 10, 11, 12]], + [[13, 14, 15, 16,], + [17, 18, 19, 20,], + [21, 22, 23, 24]]]) +#Splitting the array along axis=2 +result = np.dsplit(array,2) +print(result) +``` + +**Output** +``` +[array([[[ 1, 2], + [ 5, 6], + [ 9, 10]], + + [[13, 14], + [17, 18], + [21, 22]]]), array([[[ 3, 4], + [ 7, 8], + [11, 12]], + + [[15, 16], + [19, 20], + [23, 24]]])] +``` diff --git a/contrib/numpy/statistical-functions.md b/contrib/numpy/statistical-functions.md new file mode 100644 index 00000000..06fbae22 --- /dev/null +++ b/contrib/numpy/statistical-functions.md @@ -0,0 +1,154 @@ +# Statistical Operations on Arrays + +Statistics involves collecting data, analyzing it, and drawing conclusions from the gathered information. + +NumPy provides powerful statistical functions to perform efficient data analysis on arrays, including `minimum`, `maximum`, `mean`, `median`, `variance`, `standard deviation`, and more. + +## Minimum + +In NumPy, the minimum value of an array is the smallest element present. + +The smallest element of an array is calculated using the `np.min()` function. + +**Code** +```python +import numpy as np +array = np.array([100,20,300,400]) +#Calculating the minimum +result = np.min(array) +print("Minimum :", result) +``` + +**Output** +``` +Minimum : 20 +``` + +## Maximum + +In NumPy, the maximum value of an array is the largest element present. + +The largest element of an array is calculated using the `np.max()` function. + +**Code** +```python +import numpy as np +array = np.array([100,20,300,400]) +#Calculating the maximum +result = np.max(array) +print("Maximum :", result) +``` + +**Output** +``` +Maximum : 400 +``` + +## Mean + +The mean value of a NumPy array is the average of all its elements. + +It is calculated by summing all the elements and then dividing by the total number of elements. + +The mean of an array is calculated using the `np.mean()` function. + +**Code** +```python +import numpy as np +array = np.array([10,20,30,40]) +#Calculating the mean +result = np.mean(array) +print("Mean :", result) +``` + +**Output** +``` +Mean : 25.0 +``` + +## Median + +The median value of a NumPy array is the middle value in a sorted array. + +It separates the higher half of the data from the lower half. + +The median of an array is calculated using the `np.median()` function. + +It is important to note that: + +- If the number of elements is `odd`, the median is the middle element. +- If the number of elements is `even`, the median is the average of the two middle elements. + +**Code** +```python +import numpy as np +#The number of elements is odd +array = np.array([5,6,7,8,9]) +#Calculating the median +result = np.median(array) +print("Median :", result) +``` + +**Output** +``` +Median : 7.0 +``` + +**Code** +```python +import numpy as np +#The number of elements is even +array = np.array([1,2,3,4,5,6]) +#Calculating the median +result = np.median(array) +print("Median :", result) +``` + +**Output** +``` +Median : 3.5 +``` + +## Variance + +Variance in a NumPy array measures the spread or dispersion of data points. + +Calculated as the average of the squared differences from the mean. + +The variance of an array is calculated using the `np.var()` function. + +**Code** +```python +import numpy as np +array = np.array([10,70,80,50,30]) +#Calculating the variance +result = np.var(array) +print("Variance :", result) +``` + +**Output** +``` +Variance : 656.0 +``` + +## Standard Deviation + +The standard deviation of a NumPy array measures the amount of variation or dispersion of the elements in the array. + +It is calculated as the square root of the average of the squared differences from the mean, providing insight into how spread out the values are around the mean. + +The standard deviation of an array is calculated using the `np.std()` function. + +**Code** +```python +import numpy as np +array = np.array([25,30,40,55,75,100]) +#Calculating the standard deviation +result = np.std(array) +print("Standard Deviation :", result) +``` + +**Output** +``` +Standard Deviation : 26.365486699260625 +``` diff --git a/contrib/numpy/universal-functions.md b/contrib/numpy/universal-functions.md new file mode 100644 index 00000000..090f33c5 --- /dev/null +++ b/contrib/numpy/universal-functions.md @@ -0,0 +1,130 @@ +# Universal functions (ufunc) + +--- + +A `ufunc`, short for "`universal function`," is a fundamental concept in NumPy, a powerful library for numerical computing in Python. Universal functions are highly optimized, element-wise functions designed to perform operations on data stored in NumPy arrays. + + + +## Uses of Ufuncs in NumPy + +Universal functions (ufuncs) in NumPy provide a wide range of functionalities for efficient and powerful numerical computations. Below is a detailed explanation of their uses: + +### 1. **Element-wise Operations** +Ufuncs perform operations on each element of the arrays independently. + +```python +import numpy as np + +A = np.array([1, 2, 3, 4]) +B = np.array([5, 6, 7, 8]) + +# Element-wise addition +np.add(A, B) # Output: array([ 6, 8, 10, 12]) +``` + +### 2. **Broadcasting** +Ufuncs support broadcasting, allowing operations on arrays with different shapes, making it possible to perform operations without explicitly reshaping arrays. + +```python +C = np.array([1, 2, 3]) +D = np.array([[1], [2], [3]]) + +# Broadcasting addition +np.add(C, D) # Output: array([[2, 3, 4], [3, 4, 5], [4, 5, 6]]) +``` + +### 3. **Vectorization** +Ufuncs are vectorized, meaning they are implemented in low-level C code, allowing for fast execution and avoiding the overhead of Python loops. + +```python +# Vectorized square root +np.sqrt(A) # Output: array([1., 1.41421356, 1.73205081, 2.]) +``` + +### 4. **Type Flexibility** +Ufuncs handle various data types and perform automatic type casting as needed. + +```python +E = np.array([1.0, 2.0, 3.0]) +F = np.array([4, 5, 6]) + +# Addition with type casting +np.add(E, F) # Output: array([5., 7., 9.]) +``` + +### 5. **Reduction Operations** +Ufuncs support reduction operations, such as summing all elements of an array or finding the product of all elements. + +```python +# Summing all elements +np.add.reduce(A) # Output: 10 + +# Product of all elements +np.multiply.reduce(A) # Output: 24 +``` + +### 6. **Accumulation Operations** +Ufuncs can perform accumulation operations, which keep a running tally of the computation. + +```python +# Cumulative sum +np.add.accumulate(A) # Output: array([ 1, 3, 6, 10]) +``` + +### 7. **Reduceat Operations** +Ufuncs can perform segmented reductions using the `reduceat` method, which applies the ufunc at specified intervals. + +```python +G = np.array([0, 1, 2, 3, 4, 5, 6, 7]) +indices = [0, 2, 5] +np.add.reduceat(G, indices) # Output: array([ 1, 9, 18]) +``` + +### 8. **Outer Product** +Ufuncs can compute the outer product of two arrays, producing a matrix where each element is the result of applying the ufunc to each pair of elements from the input arrays. + +```python +# Outer product +np.multiply.outer([1, 2, 3], [4, 5, 6]) +# Output: array([[ 4, 5, 6], +# [ 8, 10, 12], +# [12, 15, 18]]) +``` + +### 9. **Out Parameter** +Ufuncs can use the `out` parameter to store results in a pre-allocated array, saving memory and improving performance. + +```python +result = np.empty_like(A) +np.multiply(A, B, out=result) # Output: array([ 5, 12, 21, 32]) +``` + +# Create Your Own Ufunc + +You can create custom ufuncs for specific needs using np.frompyfunc or np.vectorize, allowing Python functions to behave like ufuncs. + +Here, we are using `frompyfunc()` which takes three argument: + +1. function - the name of the function. +2. inputs - the number of input (arrays). +3. outputs - the number of output arrays. + +```python +def my_add(x, y): + return x + y + +my_add_ufunc = np.frompyfunc(my_add, 2, 1) +my_add_ufunc(A, B) # Output: array([ 6, 8, 10, 12], dtype=object) +``` +# Some Common Ufunc are + +Here are some commonly used ufuncs in NumPy: + +- **Arithmetic**: `np.add`, `np.subtract`, `np.multiply`, `np.divide` +- **Trigonometric**: `np.sin`, `np.cos`, `np.tan` +- **Exponential and Logarithmic**: `np.exp`, `np.log`, `np.log10` +- **Comparison**: `np.maximum`, `np.minimum`, `np.greater`, `np.less` +- **Logical**: `np.logical_and`, `np.logical_or`, `np.logical_not` + +For more such Ufunc, address to [Universal functions (ufunc) — NumPy](https://numpy.org/doc/stable/reference/ufuncs.html) diff --git a/contrib/pandas/index.md b/contrib/pandas/index.md index e5a83533..db008e2f 100644 --- a/contrib/pandas/index.md +++ b/contrib/pandas/index.md @@ -9,3 +9,4 @@ - [Working with Date & Time in Pandas](datetime.md) - [Importing and Exporting Data in Pandas](import-export.md) - [Handling Missing Values in Pandas](handling-missing-values.md) +- [Pandas Series](pandas-series.md) diff --git a/contrib/pandas/pandas-series.md b/contrib/pandas/pandas-series.md new file mode 100644 index 00000000..88b12351 --- /dev/null +++ b/contrib/pandas/pandas-series.md @@ -0,0 +1,317 @@ +# Pandas Series + +A series is a Panda data structures that represents a one dimensional array-like object containing an array of data and an associated array of data type labels, called index. + +## Creating a Series object: + +### Basic Series +To create a basic Series, you can pass a list or array of data to the `pd.Series()` function. + +```python +import pandas as pd + +s1 = pd.Series([4, 5, 2, 3]) +print(s1) +``` + +#### Output +``` +0 4 +1 5 +2 2 +3 3 +dtype: int64 +``` + +### Series from a Dictionary + +If you pass a dictionary to `pd.Series()`, the keys become the index and the values become the data of the Series. +```python +import pandas as pd + +s2 = pd.Series({'A': 1, 'B': 2, 'C': 3}) +print(s2) +``` + +#### Output +``` +A 1 +B 2 +C 3 +dtype: int64 +``` + + +## Additional Functionality + + +### Specifying Data Type and Index +You can specify the data type and index while creating a Series. +```python +import pandas as pd + +s4 = pd.Series([1, 2, 3], index=['a', 'b', 'c'], dtype='float64') +print(s4) +``` + +#### Output +``` +a 1.0 +b 2.0 +c 3.0 +dtype: float64 +``` + +### Specifying NaN Values: +* Sometimes you need to create a series object of a certain size but you do not have complete data available so in such cases you can fill missing data with a NaN(Not a Number) value. +* When you store NaN value in series object, the data type must be floating pont type. Even if you specify an integer type , pandas will promote it to floating point type automatically because NaN is not supported by integer type. + +```python +import pandas as pd +s3=pd.Series([1,np.Nan,2]) +print(s3) +``` + +#### Output +``` +0 1.0 +1 NaN +2 2.0 +dtype: float64 +``` + + +### Creating Data from Expressions +You can create a Series using an expression or function. + +``=np.Series(data=,index=None) + +```python +import pandas as pd +a=np.arange(1,5) # [1,2,3,4] +s5=pd.Series(data=a**2,index=a) +print(s5) +``` + +#### Output +``` +1 1 +2 4 +3 9 +4 16 +dtype: int64 +``` + +## Series Object Attributes + +| **Attribute** | **Description** | +|--------------------------|---------------------------------------------------| +| `.index` | Array of index of the Series | +| `.values` | Array of values of the Series | +| `.dtype` | Return the dtype of the data | +| `.shape` | Return a tuple representing the shape of the data | +| `.ndim` | Return the number of dimensions of the data | +| `.size` | Return the number of elements in the data | +| `.hasnans` | Return True if there is any NaN in the data | +| `.empty` | Return True if the Series object is empty | + +- If you use len() on a series object then it return total number of elements in the series object whereas .count() return only the number of non NaN elements. + +## Accessing a Series object and its elements + +### Accessing Individual Elements +You can access individual elements using their index. +'legal' indexes arte used to access individual element. +```python +import pandas as pd + +s7 = pd.Series(data=[13, 45, 67, 89], index=['A', 'B', 'C', 'D']) +print(s7['A']) +``` + +#### Output +``` +13 +``` + +### Slicing a Series + +- Slices are extracted based on their positional index, regardless of the custom index labels. +- Each element in the Series has a positional index starting from 0 (i.e., 0 for the first element, 1 for the second element, and so on). +- `[:]` will return the values of the elements between the start and end positions (excluding the end position). + +#### Example + +```python +import pandas as pd + +s = pd.Series(data=[13, 45, 67, 89], index=['A', 'B', 'C', 'D']) +print(s[:2]) +``` + +#### Output +``` +A 13 +B 45 +dtype: int64 +``` + +This example demonstrates that the first two elements (positions 0 and 1) are returned, regardless of their custom index labels. + +## Operation on series object + +### Modifying elements and indexes +* [indexes]=< new data value > +* [start : end]=< new data value > +* .index=[new indexes] + +```python +import pandas as pd + +s8 = pd.Series([10, 20, 30], index=['a', 'b', 'c']) +s8['a'] = 100 +s8.index = ['x', 'y', 'z'] +print(s8) +``` + +#### Output +``` +x 100 +y 20 +z 30 +dtype: int64 +``` + +**Note: Series object are value-mutable but size immutable objects.** + +### Vector operations +We can perform vector operations such as `+`,`-`,`/`,`%` etc. + +#### Addition +```python +import pandas as pd + +s9 = pd.Series([1, 2, 3]) +print(s9 + 5) +``` + +#### Output +``` +0 6 +1 7 +2 8 +dtype: int64 +``` + +#### Subtraction +```python +print(s9 - 2) +``` + +#### Output +``` +0 -1 +1 0 +2 1 +dtype: int64 +``` + +### Arthmetic on series object + +#### Addition +```python +import pandas as pd + +s10 = pd.Series([1, 2, 3]) +s11 = pd.Series([4, 5, 6]) +print(s10 + s11) +``` + +#### Output +``` +0 5 +1 7 +2 9 +dtype: int64 +``` + +#### Multiplication + +```python +print("s10 * s11) +``` + +#### Output +``` +0 4 +1 10 +2 18 +dtype: int64 +``` + +Here one thing we should keep in mind that both the series object should have same indexes otherwise it will return NaN value to all the indexes of two series object . + + +### Head and Tail Functions + +| **Functions** | **Description** | +|--------------------------|---------------------------------------------------| +| `.head(n)` | return the first n elements of the series | +| `.tail(n)` | return the last n elements of the series | + +```python +import pandas as pd + +s12 = pd.Series([10, 20, 30, 40, 50, 60, 70, 80, 90, 100]) +print(s12.head(3)) +print(s12.tail(3)) +``` + +#### Output +``` +0 10 +1 20 +2 30 +dtype: int64 +7 80 +8 90 +9 100 +dtype: int64 +``` + +If you dont provide any value to n the by default it give results for `n=5`. + +### Few extra functions + +| **Function** | **Description** | +|----------------------------------------|------------------------------------------------------------------------| +| `.sort_values()` | Return the Series object in ascending order based on its values. | +| `.sort_index()` | Return the Series object in ascending order based on its index. | +| `.sort_drop()` | Return the Series with the deleted index and its corresponding value. | + +```python +import pandas as pd + +s13 = pd.Series([3, 1, 2], index=['c', 'a', 'b']) +print(s13.sort_values()) +print(s13.sort_index()) +print(s13.drop('a')) +``` + +#### Output +``` +a 1 +b 2 +c 3 +dtype: int64 +a 1 +b 2 +c 3 +dtype: int64 +c 3 +b 2 +dtype: int64 +``` + +## Conclusion +In short, Pandas Series is a fundamental data structure in Python for handling one-dimensional data. It combines an array of values with an index, offering efficient methods for data manipulation and analysis. With its ease of use and powerful functionality, Pandas Series is widely used in data science and analytics for tasks such as data cleaning, exploration, and visualization. diff --git a/contrib/plotting-visualization/images/Subplots.png b/contrib/plotting-visualization/images/Subplots.png new file mode 100644 index 00000000..9a9ceb57 Binary files /dev/null and b/contrib/plotting-visualization/images/Subplots.png differ diff --git a/contrib/plotting-visualization/images/plotly-bar-colors-basic.png b/contrib/plotting-visualization/images/plotly-bar-colors-basic.png new file mode 100644 index 00000000..9b940133 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-bar-colors-basic.png differ diff --git a/contrib/plotting-visualization/images/plotly-bar-colors.png b/contrib/plotting-visualization/images/plotly-bar-colors.png new file mode 100644 index 00000000..d7e20a73 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-bar-colors.png differ diff --git a/contrib/plotting-visualization/images/plotly-bar-labels-1.png b/contrib/plotting-visualization/images/plotly-bar-labels-1.png new file mode 100644 index 00000000..2c4d9f5d Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-bar-labels-1.png differ diff --git a/contrib/plotting-visualization/images/plotly-bar-labels-2.png b/contrib/plotting-visualization/images/plotly-bar-labels-2.png new file mode 100644 index 00000000..05fcda6f Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-bar-labels-2.png differ diff --git a/contrib/plotting-visualization/images/plotly-bar-labels-3.png b/contrib/plotting-visualization/images/plotly-bar-labels-3.png new file mode 100644 index 00000000..967b0655 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-bar-labels-3.png differ diff --git a/contrib/plotting-visualization/images/plotly-bar-title.png b/contrib/plotting-visualization/images/plotly-bar-title.png new file mode 100644 index 00000000..6e622abe Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-bar-title.png differ diff --git a/contrib/plotting-visualization/images/plotly-basic-bar-plot.png b/contrib/plotting-visualization/images/plotly-basic-bar-plot.png new file mode 100644 index 00000000..7e1f300b Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-basic-bar-plot.png differ diff --git a/contrib/plotting-visualization/images/plotly-basic-line-chart.png b/contrib/plotting-visualization/images/plotly-basic-line-chart.png new file mode 100644 index 00000000..fa5955f0 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-basic-line-chart.png differ diff --git a/contrib/plotting-visualization/images/plotly-basic-pie-chart.png b/contrib/plotting-visualization/images/plotly-basic-pie-chart.png new file mode 100644 index 00000000..bb827f9a Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-basic-pie-chart.png differ diff --git a/contrib/plotting-visualization/images/plotly-basic-scatter-plot.png b/contrib/plotting-visualization/images/plotly-basic-scatter-plot.png new file mode 100644 index 00000000..64b6234e Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-basic-scatter-plot.png differ diff --git a/contrib/plotting-visualization/images/plotly-horizontal-bar-plot.png b/contrib/plotting-visualization/images/plotly-horizontal-bar-plot.png new file mode 100644 index 00000000..dde43a13 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-horizontal-bar-plot.png differ diff --git a/contrib/plotting-visualization/images/plotly-line-color.png b/contrib/plotting-visualization/images/plotly-line-color.png new file mode 100644 index 00000000..e8dbc1c4 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-line-color.png differ diff --git a/contrib/plotting-visualization/images/plotly-line-dashed.png b/contrib/plotting-visualization/images/plotly-line-dashed.png new file mode 100644 index 00000000..b7e18e2b Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-line-dashed.png differ diff --git a/contrib/plotting-visualization/images/plotly-line-dasheddotted.png b/contrib/plotting-visualization/images/plotly-line-dasheddotted.png new file mode 100644 index 00000000..c3e31fb2 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-line-dasheddotted.png differ diff --git a/contrib/plotting-visualization/images/plotly-line-datapoint-label.png b/contrib/plotting-visualization/images/plotly-line-datapoint-label.png new file mode 100644 index 00000000..6480d654 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-line-datapoint-label.png differ diff --git a/contrib/plotting-visualization/images/plotly-line-dotted.png b/contrib/plotting-visualization/images/plotly-line-dotted.png new file mode 100644 index 00000000..5f92ad4d Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-line-dotted.png differ diff --git a/contrib/plotting-visualization/images/plotly-line-markers.png b/contrib/plotting-visualization/images/plotly-line-markers.png new file mode 100644 index 00000000..1197268f Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-line-markers.png differ diff --git a/contrib/plotting-visualization/images/plotly-line-multiple-lines.png b/contrib/plotting-visualization/images/plotly-line-multiple-lines.png new file mode 100644 index 00000000..68a4139e Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-line-multiple-lines.png differ diff --git a/contrib/plotting-visualization/images/plotly-line-title.png b/contrib/plotting-visualization/images/plotly-line-title.png new file mode 100644 index 00000000..1d7ce85f Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-line-title.png differ diff --git a/contrib/plotting-visualization/images/plotly-line-width.png b/contrib/plotting-visualization/images/plotly-line-width.png new file mode 100644 index 00000000..7cbe21e3 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-line-width.png differ diff --git a/contrib/plotting-visualization/images/plotly-long-format-bar-plot.png b/contrib/plotting-visualization/images/plotly-long-format-bar-plot.png new file mode 100644 index 00000000..5bb67784 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-long-format-bar-plot.png differ diff --git a/contrib/plotting-visualization/images/plotly-pie-color-1.png b/contrib/plotting-visualization/images/plotly-pie-color-1.png new file mode 100644 index 00000000..9ff0ab91 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-pie-color-1.png differ diff --git a/contrib/plotting-visualization/images/plotly-pie-color-2.png b/contrib/plotting-visualization/images/plotly-pie-color-2.png new file mode 100644 index 00000000..d46fea98 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-pie-color-2.png differ diff --git a/contrib/plotting-visualization/images/plotly-pie-labels.png b/contrib/plotting-visualization/images/plotly-pie-labels.png new file mode 100644 index 00000000..3a246591 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-pie-labels.png differ diff --git a/contrib/plotting-visualization/images/plotly-pie-patterns.png b/contrib/plotting-visualization/images/plotly-pie-patterns.png new file mode 100644 index 00000000..a07bb3d0 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-pie-patterns.png differ diff --git a/contrib/plotting-visualization/images/plotly-pie-pull.png b/contrib/plotting-visualization/images/plotly-pie-pull.png new file mode 100644 index 00000000..202314b8 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-pie-pull.png differ diff --git a/contrib/plotting-visualization/images/plotly-pie-title.png b/contrib/plotting-visualization/images/plotly-pie-title.png new file mode 100644 index 00000000..e3d3ae7e Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-pie-title.png differ diff --git a/contrib/plotting-visualization/images/plotly-rounded-bars.png b/contrib/plotting-visualization/images/plotly-rounded-bars.png new file mode 100644 index 00000000..fa3b83b8 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-rounded-bars.png differ diff --git a/contrib/plotting-visualization/images/plotly-scatter-colour-2.png b/contrib/plotting-visualization/images/plotly-scatter-colour-2.png new file mode 100644 index 00000000..c6b3f14f Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-scatter-colour-2.png differ diff --git a/contrib/plotting-visualization/images/plotly-scatter-colour.png b/contrib/plotting-visualization/images/plotly-scatter-colour.png new file mode 100644 index 00000000..ef4819b2 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-scatter-colour.png differ diff --git a/contrib/plotting-visualization/images/plotly-scatter-hover.png b/contrib/plotting-visualization/images/plotly-scatter-hover.png new file mode 100644 index 00000000..20889573 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-scatter-hover.png differ diff --git a/contrib/plotting-visualization/images/plotly-scatter-size.png b/contrib/plotting-visualization/images/plotly-scatter-size.png new file mode 100644 index 00000000..3f8b78c2 Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-scatter-size.png differ diff --git a/contrib/plotting-visualization/images/plotly-scatter-title.png b/contrib/plotting-visualization/images/plotly-scatter-title.png new file mode 100644 index 00000000..39f85d0d Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-scatter-title.png differ diff --git a/contrib/plotting-visualization/images/plotly-wide-format-bar-plot.png b/contrib/plotting-visualization/images/plotly-wide-format-bar-plot.png new file mode 100644 index 00000000..ff3523ca Binary files /dev/null and b/contrib/plotting-visualization/images/plotly-wide-format-bar-plot.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image1.png b/contrib/plotting-visualization/images/seaborn-plotting/image1.png new file mode 100644 index 00000000..a8a6017e Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image1.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image10.png b/contrib/plotting-visualization/images/seaborn-plotting/image10.png new file mode 100644 index 00000000..e6df1bdd Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image10.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image11.png b/contrib/plotting-visualization/images/seaborn-plotting/image11.png new file mode 100644 index 00000000..e485ff71 Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image11.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image12.png b/contrib/plotting-visualization/images/seaborn-plotting/image12.png new file mode 100644 index 00000000..ae2a54dc Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image12.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image13.png b/contrib/plotting-visualization/images/seaborn-plotting/image13.png new file mode 100644 index 00000000..0f3b05cd Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image13.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image14.png b/contrib/plotting-visualization/images/seaborn-plotting/image14.png new file mode 100644 index 00000000..4bcf460e Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image14.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image15.png b/contrib/plotting-visualization/images/seaborn-plotting/image15.png new file mode 100644 index 00000000..de6603cf Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image15.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image16.png b/contrib/plotting-visualization/images/seaborn-plotting/image16.png new file mode 100644 index 00000000..ceb0df69 Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image16.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image2.png b/contrib/plotting-visualization/images/seaborn-plotting/image2.png new file mode 100644 index 00000000..a63d89e2 Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image2.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image3.png b/contrib/plotting-visualization/images/seaborn-plotting/image3.png new file mode 100644 index 00000000..2336257b Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image3.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image4.png b/contrib/plotting-visualization/images/seaborn-plotting/image4.png new file mode 100644 index 00000000..897634b4 Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image4.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image5.png b/contrib/plotting-visualization/images/seaborn-plotting/image5.png new file mode 100644 index 00000000..5b7c14f8 Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image5.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image6.png b/contrib/plotting-visualization/images/seaborn-plotting/image6.png new file mode 100644 index 00000000..ea1bbced Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image6.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image7.png b/contrib/plotting-visualization/images/seaborn-plotting/image7.png new file mode 100644 index 00000000..ff1de854 Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image7.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image8.png b/contrib/plotting-visualization/images/seaborn-plotting/image8.png new file mode 100644 index 00000000..1343cc65 Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image8.png differ diff --git a/contrib/plotting-visualization/images/seaborn-plotting/image9.png b/contrib/plotting-visualization/images/seaborn-plotting/image9.png new file mode 100644 index 00000000..a18193e0 Binary files /dev/null and b/contrib/plotting-visualization/images/seaborn-plotting/image9.png differ diff --git a/contrib/plotting-visualization/images/split-violin-plot.png b/contrib/plotting-visualization/images/split-violin-plot.png new file mode 100644 index 00000000..170d287a Binary files /dev/null and b/contrib/plotting-visualization/images/split-violin-plot.png differ diff --git a/contrib/plotting-visualization/images/stacked_violin_plots.png b/contrib/plotting-visualization/images/stacked_violin_plots.png new file mode 100644 index 00000000..a580a2c9 Binary files /dev/null and b/contrib/plotting-visualization/images/stacked_violin_plots.png differ diff --git a/contrib/plotting-visualization/images/violen-plots1.webp b/contrib/plotting-visualization/images/violen-plots1.webp new file mode 100644 index 00000000..9e842df5 Binary files /dev/null and b/contrib/plotting-visualization/images/violen-plots1.webp differ diff --git a/contrib/plotting-visualization/images/violenplotnormal.png b/contrib/plotting-visualization/images/violenplotnormal.png new file mode 100644 index 00000000..63d7c2d6 Binary files /dev/null and b/contrib/plotting-visualization/images/violenplotnormal.png differ diff --git a/contrib/plotting-visualization/images/violin-hatching.png b/contrib/plotting-visualization/images/violin-hatching.png new file mode 100644 index 00000000..ceab19b6 Binary files /dev/null and b/contrib/plotting-visualization/images/violin-hatching.png differ diff --git a/contrib/plotting-visualization/images/violin-labelling.png b/contrib/plotting-visualization/images/violin-labelling.png new file mode 100644 index 00000000..0bc7f813 Binary files /dev/null and b/contrib/plotting-visualization/images/violin-labelling.png differ diff --git a/contrib/plotting-visualization/images/violin-plot4.png b/contrib/plotting-visualization/images/violin-plot4.png new file mode 100644 index 00000000..12fb04b3 Binary files /dev/null and b/contrib/plotting-visualization/images/violin-plot4.png differ diff --git a/contrib/plotting-visualization/images/violinplotnocolor.png b/contrib/plotting-visualization/images/violinplotnocolor.png new file mode 100644 index 00000000..960dbc31 Binary files /dev/null and b/contrib/plotting-visualization/images/violinplotnocolor.png differ diff --git a/contrib/plotting-visualization/index.md b/contrib/plotting-visualization/index.md index 7e43d9b2..479ee883 100644 --- a/contrib/plotting-visualization/index.md +++ b/contrib/plotting-visualization/index.md @@ -6,5 +6,12 @@ - [Pie Charts in Matplotlib](matplotlib-pie-charts.md) - [Line Charts in Matplotlib](matplotlib-line-plots.md) - [Scatter Plots in Matplotlib](matplotlib-scatter-plot.md) +- [Violin Plots in Matplotlib](matplotlib-violin-plots.md) +- [subplots in Matplotlib](matplotlib-sub-plot.md) - [Introduction to Seaborn and Installation](seaborn-intro.md) +- [Seaborn Plotting Functions](seaborn-plotting.md) - [Getting started with Seaborn](seaborn-basics.md) +- [Bar Plots in Plotly](plotly-bar-plots.md) +- [Pie Charts in Plotly](plotly-pie-charts.md) +- [Line Charts in Plotly](plotly-line-charts.md) +- [Scatter Plots in Plotly](plotly-scatter-plots.md) diff --git a/contrib/plotting-visualization/matplotlib-sub-plot.md b/contrib/plotting-visualization/matplotlib-sub-plot.md new file mode 100644 index 00000000..16c294cc --- /dev/null +++ b/contrib/plotting-visualization/matplotlib-sub-plot.md @@ -0,0 +1,130 @@ +### 1. Using `plt.subplots()` + +The `plt.subplots()` function is a versatile and easy way to create a grid of subplots. It returns a figure and an array of Axes objects. + +#### Code Explanation + +1. **Import Libraries**: + ```python + import matplotlib.pyplot as plt + import numpy as np + ``` + +2. **Generate Sample Data**: + ```python + x = np.linspace(0, 10, 100) + y1 = np.sin(x) + y2 = np.cos(x) + y3 = np.tan(x) + ``` + +3. **Create Subplots**: + ```python + fig, axs = plt.subplots(3, 1, figsize=(8, 12)) + ``` + + - `3, 1` indicates a 3-row, 1-column grid. + - `figsize` specifies the overall size of the figure. + +4. **Plot Data**: + ```python + axs[0].plot(x, y1, 'r') + axs[0].set_title('Sine Function') + + axs[1].plot(x, y2, 'g') + axs[1].set_title('Cosine Function') + + axs[2].plot(x, y3, 'b') + axs[2].set_title('Tangent Function') + ``` + +5. **Adjust Layout and Show Plot**: + ```python + plt.tight_layout() + plt.show() + ``` + +#### Result + +The result will be a figure with three vertically stacked subplots. +![subplot Chart](images/subplots.png) + +### 2. Using `plt.subplot()` + +The `plt.subplot()` function allows you to add a single subplot at a time to a figure. + +#### Code Explanation + +1. **Import Libraries and Generate Data** (same as above). + +2. **Create Figure and Subplots**: + ```python + plt.figure(figsize=(8, 12)) + + plt.subplot(3, 1, 1) + plt.plot(x, y1, 'r') + plt.title('Sine Function') + + plt.subplot(3, 1, 2) + plt.plot(x, y2, 'g') + plt.title('Cosine Function') + + plt.subplot(3, 1, 3) + plt.plot(x, y3, 'b') + plt.title('Tangent Function') + ``` + +3. **Adjust Layout and Show Plot** (same as above). + +#### Result + +The result will be similar to the first method but created using individual subplot commands. + +![subplot Chart](images/subplots.png) + +### 3. Using `GridSpec` + +`GridSpec` allows for more complex subplot layouts. + +#### Code Explanation + +1. **Import Libraries and Generate Data** (same as above). + +2. **Create Figure and GridSpec**: + ```python + from matplotlib.gridspec import GridSpec + + fig = plt.figure(figsize=(8, 12)) + gs = GridSpec(3, 1, figure=fig) + ``` + +3. **Create Subplots**: + ```python + ax1 = fig.add_subplot(gs[0, 0]) + ax1.plot(x, y1, 'r') + ax1.set_title('Sine Function') + + ax2 = fig.add_subplot(gs[1, 0]) + ax2.plot(x, y2, 'g') + ax2.set_title('Cosine Function') + + ax3 = fig.add_subplot(gs[2, 0]) + ax3.plot(x, y3, 'b') + ax3.set_title('Tangent Function') + ``` + +4. **Adjust Layout and Show Plot** (same as above). + +#### Result + +The result will again be three subplots in a vertical stack, created using the flexible `GridSpec`. + +![subplot Chart](images/subplots.png) + +### Summary + +- **`plt.subplots()`**: Creates a grid of subplots with shared axes. +- **`plt.subplot()`**: Adds individual subplots in a figure. +- **`GridSpec`**: Allows for complex and custom subplot layouts. + +By mastering these techniques, you can create detailed and organized visualizations, enhancing the clarity and comprehension of your data presentations. \ No newline at end of file diff --git a/contrib/plotting-visualization/matplotlib-violin-plots.md b/contrib/plotting-visualization/matplotlib-violin-plots.md new file mode 100644 index 00000000..ef2ec42c --- /dev/null +++ b/contrib/plotting-visualization/matplotlib-violin-plots.md @@ -0,0 +1,277 @@ +# Violin Plots in Matplotlib + +A violin plot is a method of plotting numeric data and a probability density function. It is a combination of a box plot and a kernel density plot, providing a richer visualization of the distribution of the data. In a violin plot, each data point is represented by a kernel density plot, mirrored and joined together to form a symmetrical shape resembling a violin, hence the name. + +Violin plots are particularly useful when comparing distributions across different categories or groups. They provide insights into the shape, spread, and central tendency of the data, allowing for a more comprehensive understanding than traditional box plots. + +Violin plots offer a more detailed distribution representation, combining summary statistics and kernel density plots, handle unequal sample sizes effectively, allow easy comparison across groups, and facilitate identification of multiple modes compared to box plots. + +![Violen plot 1](images/violen-plots1.webp) + +## Prerequisites + +Before creating violin charts in matplotlib you must ensure that you have Python as well as Matplotlib installed on your system. + +## Creating a simple Violin Plot with `violinplot()` method + +A basic violin plot can be created with `violinplot()` method in `matplotlib.pyplot`. + +```Python +import matplotlib.pyplot as plt +import numpy as np + +# Creating dataset +data = [np.random.normal(0, std, 100) for std in range(1, 5)] + +# Creating Plot +plt.violinplot(data) + +# Show plot +plt.show() + +``` + +When executed, this would show the following pie chart: + + +![Basic violin plot](images/violinplotnocolor.png) + + +The `Violinplot` function in matplotlib.pyplot creates a violin plot, which is a graphical representation of the distribution of data across different levels of a categorical variable. Here's a breakdown of its usage: + +```Python +plt.violinplot(data, showmeans=False, showextrema=False) +``` + +- `data`: This parameter represents the dataset used to create the violin plot. It can be a single array or a sequence of arrays. + +- `showmeans`: This optional parameter, if set to True, displays the mean value as a point on the violin plot. Default is False. + +- `showextrema`: This optional parameter, if set to True, displays the minimum and maximum values as points on the violin plot. Default is False. + +Additional parameters can be used to further customize the appearance of the violin plot, such as setting custom colors, adding labels, and adjusting the orientation. For instance: + +```Python +plt.violinplot(data, showmedians=True, showmeans=True, showextrema=True, vert=False, widths=0.9, bw_method=0.5) +``` +- showmedians: Setting this parameter to True displays the median value as a line on the violin plot. + +- `vert`: This parameter determines the orientation of the violin plot. Setting it to False creates a horizontal violin plot. Default is True. + +- `widths`: This parameter sets the width of the violins. Default is 0.5. + +- `bw_method`: This parameter determines the method used to calculate the kernel bandwidth for the kernel density estimation. Default is 0.5. + +Using these parameters, you can customize the violin plot according to your requirements, enhancing its readability and visual appeal. + + +## Customizing Violin Plots in Matplotlib + +When customizing violin plots in Matplotlib, using `matplotlib.pyplot.subplots()` provides greater flexibility for applying customizations. + +### Coloring Violin Plots + +You can assign custom colors to the `violins` by passing an array of colors to the color parameter in `violinplot()` method. + +```Python +import matplotlib.pyplot as plt +import numpy as np + +# Creating dataset +data = [np.random.normal(0, std, 100) for std in range(1, 5)] +colors = ['tab:red', 'tab:blue', 'tab:green', 'tab:orange'] + +# Creating plot using matplotlib.pyplot.subplots() +fig, ax = plt.subplots() + +# Customizing colors of violins +for i in range(len(data)): + parts = ax.violinplot(data[i], positions=[i], vert=False, showmeans=False, showextrema=False, showmedians=True, widths=0.9, bw_method=0.5) + for pc in parts['bodies']: + pc.set_facecolor(colors[i]) + +# Show plot +plt.show() +``` +This code snippet creates a violin plot with custom colors assigned to each violin, enhancing the visual appeal and clarity of the plot. + + +![Coloring violin](images/violenplotnormal.png) + + +When customizing violin plots using `matplotlib.pyplot.subplots()`, you obtain a `Figure` object `fig` and an `Axes` object `ax`, allowing for extensive customization. Each `violin plot` consists of various components, including the `violin body`, `lines representing median and quartiles`, and `potential markers for mean and outliers`. You can customize these components using the appropriate methods and attributes of the Axes object. + +- Here's an example of how to customize violin plots: + +```Python +import matplotlib.pyplot as plt +import numpy as np + +# Creating dataset +data = [np.random.normal(0, std, 100) for std in range(1, 5)] +colors = ['tab:red', 'tab:blue', 'tab:green', 'tab:orange'] + +# Creating plot using matplotlib.pyplot.subplots() +fig, ax = plt.subplots() + +# Creating violin plots +parts = ax.violinplot(data, showmeans=False, showextrema=False, showmedians=True, widths=0.9, bw_method=0.5) + +# Customizing colors of violins +for i, pc in enumerate(parts['bodies']): + pc.set_facecolor(colors[i]) + +# Customizing median lines +for line in parts['cmedians'].get_segments(): + ax.plot(line[:, 0], line[:, 1], color='black') + +# Customizing quartile lines +for line in parts['cmedians'].get_segments(): + ax.plot(line[:, 0], line[:, 1], linestyle='--', color='black', linewidth=2) + +# Adding mean markers +for line in parts['cmedians'].get_segments(): + ax.scatter(np.mean(line[:, 0]), np.mean(line[:, 1]), marker='o', color='black') + +# Customizing axes labels +ax.set_xlabel('X Label') +ax.set_ylabel('Y Label') + +# Adding title +ax.set_title('Customized Violin Plot') + +# Show plot +plt.show() +``` + +![Customizing violin](images/violin-plot4.png) + +In this example, we customize various components of the violin plot, such as colors, line styles, and markers, to enhance its visual appeal and clarity. Additionally, we modify the axes labels and add a title to provide context to the plot. + +### Adding Hatching to Violin Plots + +You can add hatching patterns to the violin plots to enhance their visual distinction. This can be achieved by setting the `hatch` parameter in the `violinplot()` function. + +```Python +import matplotlib.pyplot as plt +import numpy as np + +# Creating dataset +data = [np.random.normal(0, std, 100) for std in range(1, 5)] +colors = ['tab:red', 'tab:blue', 'tab:green', 'tab:orange'] +hatches = ['/', '\\', '|', '-'] + +# Creating plot using matplotlib.pyplot.subplots() +fig, ax = plt.subplots() + +# Creating violin plots with hatching +parts = ax.violinplot(data, showmeans=False, showextrema=False, showmedians=True, widths=0.9, bw_method=0.5) + +for i, pc in enumerate(parts['bodies']): + pc.set_facecolor(colors[i]) + pc.set_hatch(hatches[i]) + +# Show plot +plt.show() +``` + +![violin_hatching](images/violin-hatching.png) + + + +### Labeling Violin Plots + +You can add `labels` to violin plots to provide additional information about the data. This can be achieved by setting the label parameter in the `violinplot()` function. + +An example in shown here: + +```Python +import matplotlib.pyplot as plt +import numpy as np + +# Creating dataset +data = [np.random.normal(0, std, 100) for std in range(1, 5)] +labels = ['Group {}'.format(i) for i in range(1, 5)] + +# Creating plot using matplotlib.pyplot.subplots() +fig, ax = plt.subplots() + +# Creating violin plots +parts = ax.violinplot(data, showmeans=False, showextrema=False, showmedians=True, widths=0.9, bw_method=0.5) + +# Adding labels to violin plots +for i, label in enumerate(labels): + parts['bodies'][i].set_label(label) + +# Show plot +plt.legend() +plt.show() +``` +![violin_labeling](images/violin-labelling.png) + +In this example, each violin plot is labeled according to its group, providing context to the viewer. +These customizations can be combined and further refined to create violin plots that effectively convey the underlying data distributions. + +### Stacked Violin Plots + +`Stacked violin plots` are useful when you want to compare the distribution of a `single` variable across different categories or groups. In a stacked violin plot, violins for each category or group are `stacked` on top of each other, allowing for easy visual comparison. + +```Python +import matplotlib.pyplot as plt +import numpy as np + +# Generating sample data +np.random.seed(0) +data1 = np.random.normal(0, 1, 100) +data2 = np.random.normal(2, 1, 100) +data3 = np.random.normal(1, 1, 100) + +# Creating a stacked violin plot +plt.violinplot([data1, data2, data3], showmedians=True) + +# Adding labels to x-axis ticks +plt.xticks([1, 2, 3], ['Group 1', 'Group 2', 'Group 3']) + +# Adding title and labels +plt.title('Stacked Violin Plot') +plt.xlabel('Groups') +plt.ylabel('Values') + +# Displaying the plot +plt.show() +``` +![stacked violin plots](images/stacked_violin_plots.png) + + +### Split Violin Plots + +`Split violin plots` are effective for comparing the distribution of a `single variable` across `two` different categories or groups. In a split violin plot, each violin is split into two parts representing the distributions of the variable for each category. + +```Python +import matplotlib.pyplot as plt +import numpy as np + +# Generating sample data +np.random.seed(0) +data_male = np.random.normal(0, 1, 100) +data_female = np.random.normal(2, 1, 100) + +# Creating a split violin plot +plt.violinplot([data_male, data_female], showmedians=True) + +# Adding labels to x-axis ticks +plt.xticks([1, 2], ['Male', 'Female']) + +# Adding title and labels +plt.title('Split Violin Plot') +plt.xlabel('Gender') +plt.ylabel('Values') + +# Displaying the plot +plt.show() +``` + +![Shadow](images/split-violin-plot.png) + +In both examples, we use Matplotlib's `violinplot()` function to create the violin plots. These unique features provide additional flexibility and insights when analyzing data distributions across different groups or categories. + diff --git a/contrib/plotting-visualization/plotly-bar-plots.md b/contrib/plotting-visualization/plotly-bar-plots.md new file mode 100644 index 00000000..5f2159a8 --- /dev/null +++ b/contrib/plotting-visualization/plotly-bar-plots.md @@ -0,0 +1,348 @@ +# Bar Plots in Plotly + +A bar plot or a bar chart is a type of data visualisation that represents data in the form of rectangular bars, with lengths or heights proportional to the values and data which they represent. The bar plots can be plotted both vertically and horizontally. + +It is one of the most widely used type of data visualisation as it is easy to interpret and is pleasing to the eyes. + +Plotly is a very powerful library for creating modern visualizations and it provides a very easy and intuitive method to create highly customized bar plots. + +## Prerequisites + +Before creating bar plots in Plotly you must ensure that you have Python, Plotly and Pandas installed on your system. + +## Introduction + +There are various ways to create bar plots in `plotly`. One of the prominent and easiest one is using `plotly.express`. Plotly Express is the easy-to-use, high-level interface to Plotly, which operates on a variety of types of data and produces easy-to-style figures. On the other hand you can also use `plotly.graph_objects` to create various plots. + +Here, we'll be using `plotly.express` to create the bar plots. Also we'll be converting our datasets into pandas DataFrames which makes it extremely convenient to create plots. + +Also, note that when you execute the codes in a simple python file, the output plot will be shown in your **browser**, rather than a pop-up window like in matplotlib. If you do not want that, it is **recommended to create the plots in a notebook (like jupyter)**. For this, install an additional library `nbformat`. This way you can see the output on the notebook itself, and can also render its format to png, jpg, etc. + +## Creating a simple bar plot using `plotly.express.bar` + +With `plotly.express.bar`, each row of the DataFrame is represented as a rectangular mark. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating bar plot +fig = px.bar(df, x='Years', y='Number of Cars sold') + +# Showing plot +fig.show() +``` +![Basic Bar Plot](images/plotly-basic-bar-plot.png) + +Here, we are first creating the dataset and converting it into Pandas DataFrames using dictionaries, with its keys being DataFrame columns. Next, we are plotting the bar chart by using `px.bar`. In the `x` and `y` parameters, we have to specify a column name in the DataFrame. + +**Note:** When you generate the image using above code, it will show you an **interactive plot**, if you want image, you can download it from their itself. + +## Customizing Bar Plots + +### Adding title to the graph + +Let us create an imaginary graph of number of cars sold in a various years. Simply pass the title of your graph as a parameter in `px.bar`. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating bar plot +fig = px.bar(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years') + +# Showing plot +fig.show() +``` +![Bar Plot Title](images/plotly-bar-title.png) + +### Adding bar colors and legends + +To add different colors to different bars, simply pass the column name of the x-axis or a custom column which groups different bars in `color` parameter. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating bar plot +fig = px.bar(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years', + color='Years') + +# Showing plot +fig.show() +``` +![Bar Colors Basic](images/plotly-bar-colors-basic.png) + +Now, let us consider our previous example of number of cars sold in various years and suppose that we want to add different colors to the bars from different centuries and respective legends for better interpretation. + +The easiest way to achieve this is to add a new column to the dataframe and then pass it to the `color` parameter. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] +# Creating the relevant colors dataset +colors = ['1900s','1900s','2000s','2000s','2000s'] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold, "Century":colors} +df = pd.DataFrame(dataset) + +# Creating bar plot +fig = px.bar(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years', + color='Century') + +# Showing plot +fig.show() +``` +![Bar Colors](images/plotly-bar-colors.png) + +### Adding labels to bars + +We may want to add labels to bars representing their absolute (or truncated) values for instant and accurate reading. This can be achieved by setting `text_auto` parameter to `True`. If you want custom text then you can pass a column name to the `text` parameter. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] +colors = ['1900s','1900s','2000s','2000s','2000s'] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold, "Century":colors} +df = pd.DataFrame(dataset) + +# Creating bar plot +fig = px.bar(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years', + color='Century', + text_auto=True) + +# Showing plot +fig.show() +``` +![Bar Labels-1](images/plotly-bar-labels-1.png) + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] +colors = ['1900s','1900s','2000s','2000s','2000s'] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold, "Century":colors} +df = pd.DataFrame(dataset) + +# Creating bar plot +fig = px.bar(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years', + color='Century', + text='Century') + +# Showing plot +fig.show() +``` +![Bar Labels-2](images/plotly-bar-labels-2.png) + +You can also change the features of text (or any other element of your plot) using `fig.update_traces`. + +Here, we are changing the position of text to position it outside the bars. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] +colors = ['1900s','1900s','2000s','2000s','2000s'] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold, "Century":colors} +df = pd.DataFrame(dataset) + +# Creating bar plot +fig = px.bar(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years', + color='Century', + text_auto=True) + +# Updating bar text properties +fig.update_traces(textposition="outside", cliponaxis=False) + +# Showing plot +fig.show() +``` +![Bar Labels-3](images/plotly-bar-labels-3.png) + +### Rounded Bars + +You can create rounded by specifying the radius value to `barcornerradius` in `fig.update_layout`. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] +colors = ['1900s','1900s','2000s','2000s','2000s'] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold, "Century":colors} +df = pd.DataFrame(dataset) + +# Creating bar plot +fig = px.bar(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years', + color='Century', + text_auto=True) + +# Updating bar text properties +fig.update_traces(textposition="outside", cliponaxis=False) + +# Updating figure layout +fig.update_layout({ +'plot_bgcolor': 'rgba(255, 255, 255, 1)', +'paper_bgcolor': 'rgba(255, 255, 255, 1)', +'barcornerradius': 15 +}) + +# Showing plot +fig.show() +``` + +![Rounded Bars](images/plotly-rounded-bars.png) + +## Horizontal Bar Plot + +To create a horizontal bar plot, you just have to interchange your `x` and `y` DataFrame columns. Plotly takes care of the rest! + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] +colors = ['1900s','1900s','2000s','2000s','2000s'] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold, "Century":colors} +df = pd.DataFrame(dataset) + +# Creating bar plot +fig = px.bar(df, x='Number of Cars sold', y='Years', + title='Number of cars sold in various years', + color='Century', + text_auto=True) + +# Updating bar text properties +fig.update_traces(textposition="outside", cliponaxis=False) + +# Updating figure layout +fig.update_layout({ +'barcornerradius': 30 +}) + +# Showing plot +fig.show() +``` +![Horizontal Bar Plot](images/plotly-horizontal-bar-plot.png) + +## Plotting Long Format and Wide Format Data + +Long-form data has one row per observation, and one column per variable. This is suitable for storing and displaying multivariate data i.e. with dimension greater than 2. + +```Python +# Plotting long format data + +import plotly.express as px + +# Long format dataset +long_df = px.data.medals_long() + +# Creating Bar Plot +fig = px.bar(long_df, x="nation", y="count", color="medal", title="Long-Form Input") + +# Showing Plot +fig.show() +``` +![Long format bar plot](images/plotly-long-format-bar-plot.png) + +```Python +print(long_df) + +# Output + nation medal count +0 South Korea gold 24 +1 China gold 10 +2 Canada gold 9 +3 South Korea silver 13 +4 China silver 15 +5 Canada silver 12 +6 South Korea bronze 11 +7 China bronze 8 +8 Canada bronze 12 +``` + +Wide-form data has one row per value of one of the first variable, and one column per value of the second variable. This is suitable for storing and displaying 2-dimensional data. + +```Python +# Plotting wide format data +import plotly.express as px + +# Wide format dataset +wide_df = px.data.medals_wide() + +# Creating Bar Plot +fig = px.bar(wide_df, x="nation", y=["gold", "silver", "bronze"], title="Wide-Form Input") + +# Showing Plot +fig.show() +``` +![Wide format bar plot](images/plotly-wide-format-bar-plot.png) + +```Python +print(wide_df) + +# Output + nation gold silver bronze +0 South Korea 24 13 11 +1 China 10 15 8 +2 Canada 9 12 12 \ No newline at end of file diff --git a/contrib/plotting-visualization/plotly-line-charts.md b/contrib/plotting-visualization/plotly-line-charts.md new file mode 100644 index 00000000..35a2bea3 --- /dev/null +++ b/contrib/plotting-visualization/plotly-line-charts.md @@ -0,0 +1,300 @@ +# Line Charts in Plotly + +A line chart displays information as a series of data points connected by straight line segments. It represents the change in a quantity with respect to another quantity and helps us to see trends and patterns over time or across categories. It is a basic type of chart common in many fields. For example, it is used to represent the price of stocks with respect to time, among many others. + +It is one of the most widely used type of data visualisation as it is easy to interpret and is pleasing to the eyes. + +Plotly is a very powerful library for creating modern visualizations and it provides a very easy and intuitive method to create highly customized line charts. + +## Prerequisites + +Before creating line charts in Plotly you must ensure that you have Python, Plotly and Pandas installed on your system. + +## Introduction + +There are various ways to create line charts in `plotly`. One of the prominent and easiest one is using `plotly.express`. Plotly Express is the easy-to-use, high-level interface to Plotly, which operates on a variety of types of data and produces easy-to-style figures. On the other hand you can also use `plotly.graph_objects` to create various plots. + +Here, we'll be using `plotly.express` to create the line charts. Also we'll be converting our datasets into pandas DataFrames which makes it extremely convenient to create plots. + +Also, note that when you execute the codes in a simple python file, the output plot will be shown in your **browser**, rather than a pop-up window like in matplotlib. If you do not want that, it is **recommended to create the plots in a notebook (like jupyter)**. For this, install an additional library `nbformat`. This way you can see the output on the notebook itself, and can also render its format to png, jpg, etc. + +## Creating a simple line chart using `plotly.express.line` + +With `plotly.express.line`, each data point is represented as a vertex (which location is given by the x and y columns) of a polyline mark in 2D space. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating line chart +fig = px.line(df, x='Years', y='Number of Cars sold') + +# Showing plot +fig.show() +``` + +![Basic Line Chart](images/plotly-basic-line-chart.png) + +Here, we are first creating the dataset and converting it into Pandas DataFrames using dictionaries, with its keys being DataFrame columns. Next, we are plotting the line chart by using `px.line`. In the `x` and `y` parameters, we have to specify a column name in the DataFrame. + +**Note:** When you generate the image using above code, it will show you an **interactive plot**, if you want image, you can download it from their itself. + +## Customizing Line Charts + +### Adding title to the chart + +Simply pass the title of your graph as a parameter in `px.line`. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating line chart +fig = px.line(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years') + +# Showing plot +fig.show() +``` + +![Line Chart Title](images/plotly-line-title.png) + +### Adding Markers to the lines + +The `markers` argument can be set to `True` to show markers on lines. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating line chart +fig = px.line(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years', + markers=True) + +# Showing plot +fig.show() +``` + +![Line Markers](images/plotly-line-markers.png) + +### Dashed Lines + +You can plot dashed lines by changing the `dash` property of `line` to `dash` or `longdash` and passing it as a dictionary to `patch` parameter in `fig.update_traces`. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating line chart +fig = px.line(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years') + +fig.update_traces(patch={"line": {"dash": 'dash'}}) + +# Showing plot +fig.show() +``` + +![Dashed Line](images/plotly-line-dashed.png) + +### Dotted Lines + +You can plot dotted lines by changing the `dash` property of `line` to `dot` and passing it as a dictionary to `patch` parameter in `fig.update_traces`. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating line chart +fig = px.line(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years') + +fig.update_traces(patch={"line": {"dash": 'dot'}}) + +# Showing plot +fig.show() +``` + +![Dotted Line](images/plotly-line-dotted.png) + +### Dashed and Dotted Lines + +You can plot dotted lines by changing the `dash` property of `line` to `dashdot` and passing it as a dictionary to `patch` parameter in `fig.update_traces`. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating line chart +fig = px.line(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years') + +fig.update_traces(patch={"line": {"dash": 'dashdot'}}) + +# Showing plot +fig.show() +``` + +![Dotted Line](images/plotly-line-dasheddotted.png) + +### Changing line colors + +You can set custom colors to lines by changing the `color` property of `line` to `your_color` and passing it as a dictionary to `patch` parameter in `fig.update_traces`. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating line chart +fig = px.line(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years') + +fig.update_traces(patch={"line": {"color": 'red'}}) + +# Showing plot +fig.show() +``` + +![Colored Line](images/plotly-line-color.png) + +### Changing line width + +You can set custom width to lines by changing the `width` property of `line` to `your_width` and passing it as a dictionary to `patch` parameter in `fig.update_traces`. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating line chart +fig = px.line(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years') + +fig.update_traces(patch={"line": {"width": 7}}) + +# Showing plot +fig.show() +``` + +![Width Line](images/plotly-line-width.png) + +### Labeling Data Points + +You can label your data points by passing the relevant column name of your DataFrame to `text` parameter in `px.line`. + +```Python +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years":years, "Number of Cars sold":num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating line chart +fig = px.line(df, x='Years', y='Number of Cars sold', + title='Number of cars sold in various years', + text='Number of Cars sold') + +fig.update_traces(textposition="bottom right") + +# Showing plot +fig.show() +``` + +![Data Point Labelling](images/plotly-line-datapoint-label.png) + +## Plotting multiple lines + +There are several ways to plot multiple lines in plotly, like using `plotly.graph_objects`, using `fig.add_scatter`, having multiple columns in the DataFrame, etc. + +Here, we'll be creating a simple dataset of the runs scored by the end of each over by India and South Africa in recent T20 World Cup Final and plot it using plotly. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +overs = list(range(0,21)) +runs_india = [0,15,23,26,32,39,45,49,59,68,75,82,93,98,108,118,126,134,150,167,176] +runs_rsa = [0,6,11,14,22,32,42,49,62,71,81,93,101,109,123,147,151,155,157,161,169] + +# Converting dataset to pandas DataFrame +dataset = {"overs":overs, "India":runs_india, "South Africa":runs_rsa} +df = pd.DataFrame(dataset) + +# Creating line chart +fig = px.line(df, x="overs", y=["India", "South Africa"]) +fig.update_layout(xaxis_title="Overs", yaxis_title="Runs", legend_title=None) + +# Showing plot +fig.show() +``` + +![Multiple Lines](images/plotly-line-multiple-lines.png) + +To plot multiple lines, we have passed multiple columns of the DataFrame in the `y` parameter. diff --git a/contrib/plotting-visualization/plotly-pie-charts.md b/contrib/plotting-visualization/plotly-pie-charts.md new file mode 100644 index 00000000..2f788096 --- /dev/null +++ b/contrib/plotting-visualization/plotly-pie-charts.md @@ -0,0 +1,221 @@ +# Pie Charts in Plotly + +A pie chart is a type of graph that represents the data in the circular graph. The slices of pie show the relative size of the data, and it is a type of pictorial representation of data. A pie chart requires a list of categorical variables and numerical variables. Here, the term "pie" represents the whole, and the "slices" represent the parts of the whole. + +Pie charts are commonly used in business presentations like sales, operations, survey results, resources, etc. as they are pleasing to the eye and provide a quick summary. + +Plotly is a very powerful library for creating modern visualizations and it provides a very easy and intuitive method to create highly customized pie charts. + +## Prerequisites + +Before creating bar plots in Plotly you must ensure that you have Python, Plotly and Pandas installed on your system. + +## Introduction + +There are various ways to create pie charts in `plotly`. One of the prominent and easiest one is using `plotly.express`. Plotly Express is the easy-to-use, high-level interface to Plotly, which operates on a variety of types of data and produces easy-to-style figures. On the other hand you can also use `plotly.graph_objects` to create various plots. + +Here, we'll be using `plotly.express` to create the pie charts. Also we'll be converting our datasets into pandas DataFrames which makes it extremely convenient and easy to create charts. + +Also, note that when you execute the codes in a simple python file, the output plot will be shown in your **browser**, rather than a pop-up window like in matplotlib. If you do not want that, it is **recommended to create the plots in a notebook (like jupyter)**. For this, install an additional library `nbformat`. This way you can see the output on the notebook itself, and can also render its format to png, jpg, etc. + +## Creating a simple pie chart using `plotly.express.pie` + +In `plotly.express.pie`, data visualized by the sectors of the pie is set in values. The sector labels are set in names. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +flowers = ['Rose','Tulip','Marigold','Sunflower','Daffodil'] +petals = [11,9,17,4,7] + +# Converting dataset to pandas DataFrame +dataset = {'flowers':flowers, 'petals':petals} +df = pd.DataFrame(dataset) + +# Creating pie chart +fig = px.pie(df, values='petals', names='flowers') + +# Showing plot +fig.show() +``` +![Basic Pie Chart](images/plotly-basic-pie-chart.png) + +Here, we are first creating the dataset and converting it into Pandas DataFrames using dictionaries, with its keys being DataFrame columns. Next, we are plotting the pie chart by using `px.pie`. In the `values` and `names` parameters, we have to specify a column name in the DataFrame. + +`px.pie(df, values='Petals', names='Flowers')` is used to specify that the pie chart is to be plotted by taking the values from column `Petals` and the fractional area of each slice is represented by **petal/sum(petals)**. The column `flowers` represents the labels of slices corresponding to each value in `petals`. + +**Note:** When you generate the image using above code, it will show you an **interactive plot**, if you want image, you can download it from their itself. + +## Customizing Pie Charts + +### Adding title to the chart + +Simply pass the title of your chart as a parameter in `px.pie`. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +flowers = ['Rose','Tulip','Marigold','Sunflower','Daffodil'] +petals = [11,9,17,4,7] + +# Converting dataset to pandas DataFrame +dataset = {'flowers':flowers, 'petals':petals} +df = pd.DataFrame(dataset) + +# Creating pie chart +fig = px.pie(df, values='petals', names='flowers', + title='Number of Petals in Flowers') + +# Showing plot +fig.show() +``` +![Title in Pie Chart](images/plotly-pie-title.png) + +### Coloring Slices + +There are a lot of beautiful color scales available in plotly and can be found here [plotly color scales](https://plotly.com/python/builtin-colorscales/). Choose your favourite colorscale apply it like this: + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +flowers = ['Rose','Tulip','Marigold','Sunflower','Daffodil'] +petals = [11,9,17,4,7] + +# Converting dataset to pandas DataFrame +dataset = {'flowers':flowers, 'petals':petals} +df = pd.DataFrame(dataset) + +# Creating pie chart +fig = px.pie(df, values='petals', names='flowers', + title='Number of Petals in Flowers', + color_discrete_sequence=px.colors.sequential.Agsunset) + +# Showing plot +fig.show() +``` +![Pie Chart Colors-1](images/plotly-pie-color-1.png) + +You can also set custom colors for each label by passing it as a dictionary(map) in `color_discrete_map`, like this: + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +flowers = ['Rose','Tulip','Marigold','Sunflower','Daffodil'] +petals = [11,9,17,4,7] + +# Converting dataset to pandas DataFrame +dataset = {'flowers':flowers, 'petals':petals} +df = pd.DataFrame(dataset) + +# Creating pie chart +fig = px.pie(df, values='petals', names='flowers', + title='Number of Petals in Flowers', + color='flowers', + color_discrete_map={'Rose':'red', + 'Tulip':'magenta', + 'Marigold':'green', + 'Sunflower':'yellow', + 'Daffodil':'royalblue'}) + +# Showing plot +fig.show() +``` +![Pie Chart Colors-1](images/plotly-pie-color-2.png) + +### Labeling Slices + +You can use `fig.update_traces` to effectively control the properties of text being displayed on your figure, for example if we want both flower name , petal count and percentage in our slices, we can do it like this: + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +flowers = ['Rose','Tulip','Marigold','Sunflower','Daffodil'] +petals = [11,9,17,4,7] + +# Converting dataset to pandas DataFrame +dataset = {'flowers':flowers, 'petals':petals} +df = pd.DataFrame(dataset) + +# Creating pie chart +fig = px.pie(df, values='petals', names='flowers', + title='Number of Petals in Flowers') + +# Updating text properties +fig.update_traces(textposition='inside', textinfo='label+value+percent') + +# Showing plot +fig.show() +``` +![Pie Labels](images/plotly-pie-labels.png) + +### Pulling out a slice + +To pull out a slice pass an array to parameter `pull` in `fig.update_traces` corresponding to the slices and amount to be pulled. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +flowers = ['Rose','Tulip','Marigold','Sunflower','Daffodil'] +petals = [11,9,17,4,7] + +# Converting dataset to pandas DataFrame +dataset = {'flowers':flowers, 'petals':petals} +df = pd.DataFrame(dataset) + +# Creating pie chart +fig = px.pie(df, values='petals', names='flowers', + title='Number of Petals in Flowers') + +# Updating text properties +fig.update_traces(textposition='inside', textinfo='label+value') + +# Pulling out slice +fig.update_traces(pull=[0,0,0,0.2,0]) + +# Showing plot +fig.show() +``` +![Slice Pull](images/plotly-pie-pull.png) + +### Pattern Fills + +You can also add patterns (hatches), in addition to colors in pie charts. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +flowers = ['Rose','Tulip','Marigold','Sunflower','Daffodil'] +petals = [11,9,17,4,7] + +# Converting dataset to pandas DataFrame +dataset = {'flowers':flowers, 'petals':petals} +df = pd.DataFrame(dataset) + +# Creating pie chart +fig = px.pie(df, values='petals', names='flowers', + title='Number of Petals in Flowers') + +# Updating text properties +fig.update_traces(textposition='outside', textinfo='label+value') + +# Adding pattern fills +fig.update_traces(marker=dict(pattern=dict(shape=[".", "/", "+", "-","+"]))) + +# Showing plot +fig.show() +``` +![Pattern Fills Pie Chart](images/plotly-pie-patterns.png) \ No newline at end of file diff --git a/contrib/plotting-visualization/plotly-scatter-plots.md b/contrib/plotting-visualization/plotly-scatter-plots.md new file mode 100644 index 00000000..dc40b1f7 --- /dev/null +++ b/contrib/plotting-visualization/plotly-scatter-plots.md @@ -0,0 +1,198 @@ +# Scatter Plots in Plotly + +* A scatter plot is a type of data visualization that uses dots to show values for two variables, with one variable on the x-axis and the other on the y-axis. It's useful for identifying relationships, trends, and correlations, as well as spotting clusters and outliers. +* The dots on the plot shows how the variables are related. A scatter plot is made with the plotly library's `px.scatter()`. + +## Prerequisites + +Before creating Scatter plots in Plotly you must ensure that you have Python, Plotly and Pandas installed on your system. + +## Introduction + +There are various ways to create Scatter plots in `plotly`. One of the prominent and easiest one is using `plotly.express`. Plotly Express is the easy-to-use, high-level interface to Plotly, which operates on a variety of types of data and produces easy-to-style figures. On the other hand you can also use `plotly.graph_objects` to create various plots. + +Here, we'll be using `plotly.express` to create the Scatter Plots. Also we'll be converting our datasets into pandas DataFrames which makes it extremely convenient and easy to create charts. + +Also, note that when you execute the codes in a simple python file, the output plot will be shown in your **browser**, rather than a pop-up window like in matplotlib. If you do not want that, it is **recommended to create the plots in a notebook (like jupyter)**. For this, install an additional library `nbformat`. This way you can see the output on the notebook itself, and can also render its format to png, jpg, etc. + +## Creating a simple Scatter Plot using `plotly.express.scatter` + +In `plotly.express.scatter`, each data point is represented as a marker point, whose location is given by the x and y columns. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years": years, "Number of Cars sold": num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating scatter plot +fig = px.scatter(df, x='Years', y='Number of Cars sold') + +# Showing plot +fig.show() +``` +![Basic Scatter Plot](images/plotly-basic-scatter-plot.png) + +Here, we are first creating the dataset and converting it into a pandas DataFrame using a dictionary, with its keys being DataFrame columns. Next, we are plotting the scatter plot by using `px.scatter`. In the `x` and `y` parameters, we have to specify a column name in the DataFrame. + +`px.scatter(df, x='Years', y='Number of Cars sold')` is used to specify that the scatter plot is to be plotted by taking the values from column `Years` for the x-axis and the values from column `Number of Cars sold` for the y-axis. + +Note: When you generate the image using the above code, it will show you an interactive plot. If you want an image, you can download it from the interactive plot itself. + +## Customizing Scatter Plots + +### Adding title to the plot + +Simply pass the title of your plot as a parameter in `px.scatter`. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years": years, "Number of Cars sold": num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating scatter plot +fig = px.scatter(df, x='Years', y='Number of Cars sold' ,title='Number of cars sold in various years') + +# Showing plot +fig.show() +``` +![Scatter Plot title](images/plotly-scatter-title.png) + +### Adding bar colors and legends + +* To add different colors to different bars, simply pass the column name of the x-axis or a custom column which groups different bars in `color` parameter. +* There are a lot of beautiful color scales available in plotly and can be found here [plotly color scales](https://plotly.com/python/builtin-colorscales/). Choose your favourite colorscale apply it like this: + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +flowers = ['Rose','Tulip','Marigold','Sunflower','Daffodil'] +petals = [11,9,17,4,7] + +# Converting dataset to pandas DataFrame +dataset = {'flowers':flowers, 'petals':petals} +df = pd.DataFrame(dataset) + +# Creating pie chart +fig = px.pie(df, values='petals', names='flowers', + title='Number of Petals in Flowers', + color_discrete_sequence=px.colors.sequential.Agsunset) + +# Showing plot +fig.show() +``` +![Scatter Plot Colors-1](images/plotly-scatter-colour.png) + +You can also set custom colors for each label by passing it as a dictionary(map) in `color_discrete_map`, like this: + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years": years, "Number of Cars sold": num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating scatter plot +fig = px.scatter(df, x='Years', + y='Number of Cars sold' , + title='Number of cars sold in various years', + color='Years', + color_discrete_map={'1998':'red', + '1999':'magenta', + '2000':'green', + '2001':'yellow', + '2002':'royalblue'}) + +# Showing plot +fig.show() +``` +![Scatter Plot Colors-1](images/plotly-scatter-colour-2.png) + +### Setting Size of Scatter + +We may want to set the size of different scatters for visibility differences between categories. This can be done by using the `size` parameter in `px.scatter`, where we specify a column in the DataFrame that determines the size of each scatter point. + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years": years, "Number of Cars sold": num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating scatter plot +fig = px.scatter(df, x='Years', + y='Number of Cars sold' , + title='Number of cars sold in various years', + color='Years', + color_discrete_map={'1998':'red', + '1999':'magenta', + '2000':'green', + '2001':'yellow', + '2002':'royalblue'}, + size='Number of Cars sold') + +# Showing plot +fig.show() +``` +![Scatter plot size](images/plotly-scatter-size.png) + +### Giving a hover effect + +you can use the `hover_name` and `hover_data` parameters in `px.scatter`. The `hover_name` parameter specifies the column to use for the `hover text`, and the `hover_data` parameter allows you to specify additional data to display when hovering over a point + +```Python +import plotly.express as px +import pandas as pd + +# Creating dataset +years = ['1998', '1999', '2000', '2001', '2002'] +num_of_cars_sold = [200, 300, 500, 700, 1000] + +# Converting dataset to pandas DataFrame +dataset = {"Years": years, "Number of Cars sold": num_of_cars_sold} +df = pd.DataFrame(dataset) + +# Creating scatter plot +fig = px.scatter(df, x='Years', + y='Number of Cars sold' , + title='Number of cars sold in various years', + color='Years', + color_discrete_map={'1998':'red', + '1999':'magenta', + '2000':'green', + '2001':'yellow', + '2002':'royalblue'}, + size='Number of Cars sold', + hover_name='Years', + hover_data={'Number of Cars sold': True}) + +# Showing plot +fig.show() +``` +![Scatter Hover](images/plotly-scatter-hover.png) + diff --git a/contrib/plotting-visualization/seaborn-plotting.md b/contrib/plotting-visualization/seaborn-plotting.md new file mode 100644 index 00000000..655f625e --- /dev/null +++ b/contrib/plotting-visualization/seaborn-plotting.md @@ -0,0 +1,259 @@ +# Seaborn + +Seaborn is a powerful and easy-to-use data visualization library in Python built on top of Matplotlib.It provides a high-level interface for drawing attractive and informative statistical graphics.Now we will cover various functions covered by Seaborn, along with examples to illustrate their usage. +Seaborn simplifies the process of creating complex visualizations with a few lines of code and it integrates closely with pandas data structure , making it an excellent choice for data analysis and exploration. + +## Setting up Seaborn + +Make sure seaborn library is installed in your system. If not use command +`pip install seaborn` + +After installing you are all set to experiment with plotting functions. + +```python +#import necessary libraries + +import seaborn as sns +import matplotlib.pyplot as plt +import pandas as pd +``` + +Seaborn includes several built-in datasets that you can use for practice +You can list all available datasets using below command +```python +sns.get_dataset_names() +``` + +Here we are using 'tips' dataset + +```python +# loading an example dataset +tips=sns.load_dataset('tips') +``` + +Before delving into plotting, make yourself comfortable with the dataset. To do that, use the pandas library to understand what information the dataset contains and preprocess the data. If you get stuck, feel free to refer to the pandas documentation. + +## Relational Plots + +Relational plots are used to visualize the relationship between two or more variables + +### Scatter Plot +A scatter plot displays data points based on two numerical variables.Seaborn `scatterplot` function allows you to create scatter plots with ease + +```python +# scatter plot using Seaborn + +plt.figure(figsize=(5,5)) +sns.scatterplot(data=tips,x='total_bill',y='tip',hue='day',style='time') +plt.title('Scatter Plot of Total Bill vs Tip') +plt.show() +``` +![scatter plot](images/seaborn-plotting/image1.png) + +### Line Plot +A line plot connects data points in the order they appear in the dataset.This is useful for time series data.`lineplot` function allows you to create lineplots. + +```python +# lineplot using seaborn + +plt.figure(figsize=(5,5)) +sns.lineplot(data=tips,x='size',y='total_bill',hue='day') +plt.title('Line Plot of Total Bill by Size and Day') +plt.show() +``` + +![lineplot](images/seaborn-plotting/image2.png) + +## Distribution Plots + +Distribution Plots visualize the distribution of a single numerical variable + +### HistPlot +A histplot displays the distribution of a numerical variable by dividing the data into bins. + +```python +# Histplot using Seaborn + +plt.figure(figsize=(5,5)) +sns.histplot(data=tips, x='total_bill', kde=True) +plt.title('Histplot of Total Bill') +plt.show() +``` +![Histplot](images/seaborn-plotting/image3.png) + +### KDE Plot +A Kernel Density Estimate (KDE) plot represents the distribution of a variable as a smooth curve. + +```python +# KDE Plot using Seaborn + +plt.figure(figsize=(5,5)) +sns.kdeplot(data=tips, x='total_bill', hue='sex', fill=True) +plt.title('KDE Plot of Total Bill by Sex') +plt.show() +``` +![KDE](images/seaborn-plotting/image4.png) + +### ECDF Plot +An Empirical Cumulative Distribution Function (ECDF) plot shows the proportion of data points below each value. + +```python +# ECDF Plot using Seaborn + +plt.figure(figsize=(5,5)) +sns.ecdfplot(data=tips, x='total_bill', hue='sex') +plt.title('ECDF Plot of Total Bill by Sex') +plt.show() +``` +![ECDF](images/seaborn-plotting/image5.png) + +### Rug Plot +A rug plot in Seaborn is a simple way to show the distribution of a variable by drawing small vertical lines (or "rugs") at each data point along the x-axis. + +```python +# Rug Plot using Seaborn + +plt.figure(figsize=(3,3)) +sns.rugplot(x='total_bill', data=tips) +plt.title('Rug Plot of Total Bill Amounts') +plt.show() +``` +![Rug](images/seaborn-plotting/image6.png) + +## Categorical Plots +Categorical plots are used to visualize data where one or more variables are categorical. + +### Bar Plot + +A bar plot shows the relationship between a categorical variable and a numerical variable. +```python +# Bar Plot using Seaborn + +plt.figure(figsize=(5,5)) +sns.barplot(data=tips,x='day',y='total_bill',hue='sex') +plt.title('Bar Plot of Total Bill by Day and Sex') +plt.show() + ``` +![Bar](images/seaborn-plotting/image7.png) + +### Point Plot +A point plot in Seaborn is used to show the relationship between two categorical variables, with the size of the points representing the values of third variable. + +```python +# Point Plot using Seaborn + +plt.figure(figsize=(5,5)) +sns.pointplot(x='day',y='total_bill',hue='sex',data=tips) +plt.title('Average Total Bill by Day and Sex') +plt.show() +``` +![Point](images/seaborn-plotting/image8.png) + +### Box Plot +A box plot displays the distribution of a numerical variable across different categories. + +```python +# Box Plot using Seaborn + +plt.figure(figsize=(5,5)) +sns.boxplot(data=tips, x='day', y='total_bill', hue='sex') +plt.title('Box Plot of Total Bill by Day and Sex') +plt.show() +``` +![Box](images/seaborn-plotting/image9.png) + +### Violin Plot +A violin plot combines aspects of a box plot and a KDE plot to show the distribution of data + +```python +# Violin Plot using Seaborn + +plt.figure(figsize=(5,5)) +sns.violinplot(data=tips,x='day',y='total_bill',hue='sex',split=True) +plt.title('Violin Plot of Total Bill by Day and Sex') +plt.show() +``` +![Violin](images/seaborn-plotting/image10.png) + +## Matrix Plots +Matrix plots are useful for visualizing data in a matrix format. + +### Heatmap +A heatmap displays data in a matrix where values are represented by color. + +```python +# Heatmap using Seaborn + +plt.figure(figsize=(10,8)) +flights = sns.load_dataset('flights') +flights_pivot = flights.pivot(index='month', columns='year', values='passengers') +sns.heatmap(flights_pivot, annot=True, fmt='d', cmap='YlGnBu') +plt.title('Heatmap of Flight Passengers') +plt.show() +``` +![Heatmap](images/seaborn-plotting/image11.png) + +## Pair Plot +A pair plot shows the pairwise relationships between multiple variables in a dataset. + +```python +#Pairplot using Seaborn + +plt.figure(figsize=(5,5)) +sns.pairplot(tips, hue='sex') +plt.title('Pair Plot of Tips Dataset') +plt.show() +``` +![Pair](images/seaborn-plotting/image12.png) + +## FacetGrid +FacetGrid allows you to create a grid of plots based on the values of one or more categorical variables. + +```python +#Facetgrid using Seaborn + +plt.figure(figsize=(5,5)) +g = sns.FacetGrid(tips, col='sex', row='time', margin_titles=True) +g.map(sns.scatterplot, 'total_bill', 'tip') +plt.show() +``` +![Facetgrid](images/seaborn-plotting/image13.png) + +## Customizing Seaborn Plots +Seaborn plots can be customized to improve their appearance and convey more information. + +### Changing the Aesthetic Style +Seaborn comes with several built-in themes. + +```python +sns.set_style('whitegrid') +sns.scatterplot(data=tips, x='total_bill', y='tip') +plt.title('Scatter Plot with Whitegrid Style') +plt.show() +``` +![Aesthetic](images/seaborn-plotting/image14.png) + +### Customizing Colors +You can use color palettes to customize the colors in your plots. + +```python +sns.set_palette('pastel') +sns.barplot(data=tips, x='day', y='total_bill', hue='sex') +plt.title('Bar Plot with Pastel Palette') +plt.show() +``` +![Colors](images/seaborn-plotting/image15.png) + +### Adding Titles and Labels +Titles and labels can be added to make plots more informative. + +```python +plot = sns.scatterplot(data=tips, x='total_bill', y='tip') +plot.set_title('Scatter Plot of Total Bill vs Tip') +plot.set_xlabel('Total Bill ($)') +plot.set_ylabel('Tip ($)') +plt.show() +``` +![Titles](images/seaborn-plotting/image16.png) + +Seaborn is a versatile library that simplifies the creation of complex visualizations. By using Seaborn's plotting functions, you can create a wide range of statistical graphics with minimal effort. Whether you're working with relational data, categorical data, or distributions, Seaborn provides the tools you need to visualize your data effectively. diff --git a/contrib/scipy/index.md b/contrib/scipy/index.md index 57371141..5425c4ac 100644 --- a/contrib/scipy/index.md +++ b/contrib/scipy/index.md @@ -1,4 +1,5 @@ # List of sections - [Installation of Scipy and its key uses](installation_features.md) +- [SciPy Graphs](scipy-graphs.md) diff --git a/contrib/scipy/scipy-graphs.md b/contrib/scipy/scipy-graphs.md new file mode 100644 index 00000000..60bb5659 --- /dev/null +++ b/contrib/scipy/scipy-graphs.md @@ -0,0 +1,165 @@ +# SciPy Graphs +Graphs are also a type of data structure, SciPy provides a module called scipy.sparse.csgraph for working with graphs. + +## Adjacency Matrix +An adjacency matrix is a way of representing a graph using a square matrix. In the matrix, the element at the i-th row and j-th column indicates whether there is an edge from vertex +i to vertex j. + +```python +import numpy as np +from scipy.sparse import csr_matrix + +adj_matrix = np.array([ + [0, 1, 0, 0], + [1, 0, 1, 0], + [0, 1, 0, 1], + [0, 0, 1, 0] +]) + + +sparse_matrix = csr_matrix(adj_matrix) + +print(sparse_matrix) +``` + +In this example: + +1. The graph has 4 nodes. +2. is an edge between node 0 and node 1, node 1 and node 2, and node 2 and node 3. +3. The csr_matrix function converts the dense adjacency matrix into a compressed sparse row (CSR) format, which is efficient for storing large, sparse matrices. + +## Floyd Warshall + +The Floyd-Warshall algorithm is a classic algorithm used to find the shortest paths between all pairs of nodes in a weighted graph. + +```python +import numpy as np +from scipy.sparse.csgraph import floyd_warshall +from scipy.sparse import csr_matrix + +arr = np.array([ + [0, 1, 2], + [1, 0, 0], + [2, 0, 0] +]) + +newarr = csr_matrix(arr) + +print(floyd_warshall(newarr, return_predecessors=True)) +``` + +#### Output + +``` +(array([[0., 1., 2.], + [1., 0., 3.], + [2., 3., 0.]]), array([[-9999, 0, 0], + [ 1, -9999, 0], + [ 2, 0, -9999]], dtype=int32)) +``` + +## Dijkstra + +Dijkstra's algorithm is used to find the shortest path from a source node to all other nodes in a graph with non-negative edge weights. + +```python +import numpy as np +from scipy.sparse.csgraph import dijkstra +from scipy.sparse import csr_matrix + +arr = np.array([ + [0, 1, 2], + [1, 0, 0], + [2, 0, 0] +]) + +newarr = csr_matrix(arr) + +print(dijkstra(newarr, return_predecessors=True, indices=0)) +``` + +#### Output + +``` +(array([ 0., 1., 2.]), array([-9999, 0, 0], dtype=int32)) +``` + +## Bellman Ford + +The Bellman-Ford algorithm is used to find the shortest path from a single source vertex to all other vertices in a weighted graph. It can handle graphs with negative weights, and it also detects negative weight cycles. + +```python +import numpy as np +from scipy.sparse.csgraph import bellman_ford +from scipy.sparse import csr_matrix + +arr = np.array([ + [0, -1, 2], + [1, 0, 0], + [2, 0, 0] +]) + +newarr = csr_matrix(arr) + +print(bellman_ford(newarr, return_predecessors=True, indices=0)) +``` + +#### Output + +``` +(array([ 0., -1., 2.]), array([-9999, 0, 0], dtype=int32)) +``` + +## Depth First Order + +Depth-First Search (DFS) is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root and explores as far as possible along each branch before backtracking. + +```python +import numpy as np +from scipy.sparse.csgraph import depth_first_order +from scipy.sparse import csr_matrix + +arr = np.array([ + [0, 1, 0, 1], + [1, 1, 1, 1], + [2, 1, 1, 0], + [0, 1, 0, 1] +]) + +newarr = csr_matrix(arr) + +print(depth_first_order(newarr, 1)) +``` + +#### Output + +``` +(array([1, 0, 3, 2], dtype=int32), array([ 1, -9999, 1, 0], dtype=int32)) +``` + +## Breadth First Order + +Breadth-First Search (BFS) is an algorithm for traversing or searching tree or graph data structures. It starts at the root present depth level before moving on to nodes at the next depth level. + +```python +import numpy as np +from scipy.sparse.csgraph import breadth_first_order +from scipy.sparse import csr_matrix + +arr = np.array([ + [0, 1, 0, 1], + [1, 1, 1, 1], + [2, 1, 1, 0], + [0, 1, 0, 1] +]) + +newarr = csr_matrix(arr) + +print(breadth_first_order(newarr, 1)) +``` + +### Output + +``` +(array([1, 0, 2, 3], dtype=int32), array([ 1, -9999, 1, 1], dtype=int32)) +```