Data Engineering and Machine Learning Using Python
Module 1: Introduction to Machine Learning
▪ Introduction To Machine Learning
▪ Life Cycle of Machine Learning
▪ Skills required for Machine Learning
▪ Careers Path in Machine Learning
▪ Applications of Machine Learning
Module 3: Python for Machine Learning
▪ Python programming:
▪ Environment Setup
▪ Jupyter Notebook Overview
▪ Data types:Numbers,Strings,Printing,Lists,Dictionaries,Booleans,Tuples
,Sets
▪ Comparison Operators
▪ if,elif, else Statements
▪ Loops:for Loops,while Loops
▪ range()
▪ list comprehension
▪ functions
▪ lambda expressions
▪ map and filter
▪ methods
▪ Programming Exercises.
▪ Object Oriented Programming
▪ Modules and packages
▪ Errors and Exception Handling
▪ Python Decorators
▪ Python generators
▪ Collections
▪ Regular Expression
▪ Python for Exploratory Data Analysis:
▪ NumPy:
▪ Installing numpy
▪ Using numpy
▪ NumPy arrays
▪ Creating numpy arrays from python list
▪ Creating arrays using built in
methods(arrange(),zeros(),ones(),linspace(),eye(),rand(),etc.
▪ Array attributes :shape, type
▪ Array methods: Reshape(),min(),max(),argmax(),argmin(),etc.
▪ Pandas:
▪ Introduction to Pandas
▪ Series
▪ DataFrames
▪ Missing Data
▪ GroupBy
▪ Merging, Joining and Concatenating
▪ Operations
▪ Data Input and Output
▪ Python for Data Visualization:
▪ Matplotlib:
▪ Installing Matplotlib,Basic Matplotlib commands
▪ Creating Multiplot on same canvas
▪ Object Oriented Method:figure(),plot(),add_axes(),subplots(),etc.
▪ MatplotlibExercise
▪ Seaborn:
▪ Categorical plot
▪ Distribution plot
▪ Regression plot
▪ Seaborn Exercise
▪ Pandas built in visualization:
▪ Scatter plot
▪ Histograms
▪ Box plot
▪ CAPSTONE PROJECT FOR DATA ANALYSIS
Module 4: Deep dive into Machine Learning
▪ Introduction To Machine Learning:
▪ Relationship between Data Science and Machine Learning
▪ Supervised Learning
▪ Unsupervised Learning
Supervised Learning (Regression AND Classification Algorithms):
▪ Linear Regression
▪ Ridge Regression
▪ Lasso Regression
▪ Polynomial Regression
▪ Support vector regression
▪ Decision Tree Regression
▪ Random Forest Regression
▪ Logistic Regression
▪ Support Vector Machines
▪ Kernel SVM
▪ Decision Trees and Random Forest
▪ Ensemble Of Decision Trees
▪ Model Evaluation and Improvement
Unsupervised Learning:
▪ Challenges in Unsupervised Learning
▪ Preprocessing AND Scaling
▪ Dimensionality Reduction, Feature Extraction
▪ Principle Component Analysis (PCA)
▪ Clustering
▪ KMEANS
▪ Model evaluation and improvement
▪ Cross validation, Grid search, Evaluation metrics and scoring
▪ Working with text data
Module 5: NLP & Recommender Systems:
▪ Corpus
▪ Text preprocessing using Bag of words technique
▪ TF(Term Frequency)
▪ IDF(Inverse Document Frequency)
▪ Normalization
▪ Vectorization
▪ NLP with Python
Hadoop Developer Course
During this course you will learn:
• Linux (Ubuntu/Centos) - Tips and Tricks
• Basic Java Programming – Core Java Oops Concepts
• Introduction to Big Data and Hadoop
• Hadoop ecosystem concepts
• Hadoop MapReduce concepts and features
• Developing MapReduce applications
• Pig concepts
• Hive concepts
• Impala
• Oozie workflow concepts
• Sqoop Data Ingestion
• Flume Agents
• Tableau Visualization
HBase concepts
• Real Time tools like Hue, Putty, FileZilla, Cloudera Manager
• Real Time Projects
Linux (Ubuntu/Cent Os) - Tips and Tricks
Basic(core) Java Programming Concepts – OOPS
Introduction to Big Data and Hadoop
• What is Big Data?
• What are the challenges for processing big data?
• What is Hadoop?
• Why Hadoop?
• History of Hadoop
• Hadoop ecosystem
• HDFS
• MapReduce
Understanding the Cluster
• Hadoop 2.x Architecture
• Typical workflow
• HDFS Commands
• Writing files to HDFS
• Reading files from HDFS
• Rack awareness
• Hadoop daemons
Let's talk MapReduce
• Before MapReduce
Hadoop Developer Course
• MapReduce overview
• Word count problem
• Word count flow and solution
• MapReduce flow
Developing the MapReduce Application
• Data Types
• File Formats
• Explain the Driver, Mapper and Reducer code
• Configuring development environment - Eclipse
• Writing unit test
• Running locally
• Running on cluster
• Hands on exercises
How MapReduce Works
• Anatomy of MapReduce job run
• Job submission
• Job initialization
• Task assignment
• Job completion
• Job scheduling
• Job failures
• Shuffle and sort
• Hands on exercises
MapReduce Types and Formats
• File Formats – Sequence Files
• Compression Techniques
• Input Formats - Input splits & records, text input, binary input
• Output Formats - text output, binary output, lazy output
• Hands on exercises
MapReduce Features
Counters
• Side data distribution
• MapReduce combiner
• MapReduce partitioner
• MapReduce distributed cache
• Hands exercises
Hive
• Hive Architecture
• Types of Metastore
• Hive Data Types
Hadoop Developer Course
• HiveQL
• File Formats – Parquet, ORC, Sequence and Avro Files Comparison
• Partitioning & Bucketing
• Hive JDBC Client
• Hive UDFs
• Hive Serdes
• Hive on Tez
• Hands-on exercises
• Integration with Tableau
Pig
• Pig Architecture
• Pig Data Types
• Load/Store Functions
• PigLatin
• Pig Udfs
Hbase
• HBase architecture and concepts
• Hbase Data Model
• Hbase Shell Interface
• Hbase Java API
Sqoop
• Sqoop Architecture
• Sqoop Import Command Arguments, Incremental Import
• Sqoop Export
• Sqoop Jobs
• Hands-on exercises
Flume
• Flume Architecture
• Flume Agent Setup
• Types of sources, channels, sinks Multi Agent Flow
• Hands-on exercises
Oozie
• Oozie Fundamentals
• Oozie workflow creations
• Oozie Job submission, monitoring, debugging
• Concepts on Coordinators and Bundles
• Hands-on exercises
Case Studies Discussions
Any one of the Four Projects
• Log File Analysis covering Flume, HDFS, MR/Pig, Hive, Tableau
• Crime Data Analysis Covering Oozie, Sqoop, HDFS, Hive, Hbase, RestFul Client.
• Hadoop Use Cases in Insurance Domain
Hadoop Use Cases in Retail Domain
Scala or Python , Spark
➢ Understand the difference between Apache Spark and Hadoop
➢ Learn Scala and its programming implementation
✓ Why Scala or python
✓ Scala Installation
✓ Get deep insights into the functioning of Scala
✓ Execute Pattern Matching in Scala
✓ Functional Programming in Scala – Closures, Currying, Expressions,
Anonymous Functions
✓ Know the concepts of classes in Scala
✓ Object Orientation in Scala – Primary, Auxiliary Constructors, Singleton &
Companion Objects
✓ Traits and Abstract classes in Scala
✓ Scala Simple Build Tool – SBT
✓ Building with Maven
➢ Spark Basics
✓ What is Apache Spark?
✓ Spark Installation
✓ Spark Configuration
✓ Spark Context
✓ Using Spark Shell
✓ Resilient Distributed Datasets (RDDs) – Features, Partitions, Tuning Parallelism
✓ Functional Programming with Spark
➢ Working with RDDs
✓ RDD Operations - Transformations and Actions
✓ Types of RDDs
✓ Key-Value Pair RDDs – Transformations and Actions
✓ MapReduce and Pair RDD Operations
✓ Serialization
➢ Spark on a cluster
✓ Overview
✓ A Spark Standalone Cluster
✓ The Spark Standalone Web UI
✓ Executors & Cluster Manager
✓ Spark on YARN Framework
➢ Writing Spark Applications
✓ Spark Applications vs. Spark Shell
✓ Creating the SparkContext
✓ Configuring Spark Properties
✓ Building and Running a Spark Application
✓ Logging
✓ Spark Job Anatomy
➢ Caching and Persistence
✓ RDD Lineage
✓ Caching Overview
✓ Distributed Persistence
➢ Improving Spark Performance
✓ Shared Variables: Broadcast Variables
✓ Shared Variables: Accumulators
✓ Per Partition Processing
✓ Common Performance Issues
➢ Spark API for different File Formats & Compression Codecs
✓ Text
✓ CSV
✓ Sequence
✓ Parquet
✓ ORC
✓ Compression Techniques – Snappy, Zlib, Gzip
➢ Spark SQL
✓ Spark SQL Overview
✓ HiveContext
✓ SQL Datatypes
✓ Dataframes vs RDDs
✓ Operations on DFs
✓ Parquet Files with Spark Sql – Read, Write, Partitioning, Merging Schema
✓ ORC Files
✓ JSON Files
✓ Inferring Schema programmatically
✓ Custom Case Classes
✓ Temp Tables vs Persistent Tables
✓ Writing UDFs
✓ Hive Support
✓ JDBC Support - Examples
✓ HBase Support - Examples
➢ Spark Streaming
✓ Spark Streaming Overview
✓ Example: Streaming Word Count
✓ Other Streaming Operations
✓ Sliding Window Operations
✓ Developing Spark Streaming Applications – Integration with Kafka and Hbase
Complementary Course: AWS