Python
for Big Data
Why study Python?
Python is a powerful, flexible, open-source language that is
easy to learn, easy to use, and has powerful libraries for data
manipulation and analysis. Its simple syntax is very accessible
to programming novices, and will look familiar to anyone with
experience in Matlab, C/C++, Java, or Visual Basic. Python has
a unique combination of being both a capable general-purpose
programming language as well as being easy to use
for analytical and quantitative computing.
For over a decade, Python has been used in scientific
computing and highly quantitative domains such as finance,
oil and gas, physics, and signal processing. It has been used to
improve Space Shuttle mission design, process images from the
Hubble Space Telescope, and was instrumental in orchestrating
the physics experiments which led to the discovery of the Higgs
Boson (the so-called “God particle”).
At the same time, Python has been used to build massively
scalable web applications like YouTube, and has powered much
of Google’s internal infrastructure. Companies like Disney, Sony
Dreamworks, and Lucasfilm ILM rely heavily on Python to
coordinate massive clusters of computer graphics servers to
produce the imagery for blockbuster movies. According to the
TIOBE index, Python is one of the most popular programming
languages in the world, ranking higher than Perl, Ruby, and
JavaScript by a wide margin.
Python can be as powerful and successful for big data and
business data analytics as it has been for science, engineering,
and scalable computing. Python is easy for analysts to learn
and use, but powerful enough to tackle even the most difficult
problems in virtually any domain. It integrates well with existing
IT infrastructure, and is very platform independent. Among
modern languages, the agility and productivity of Python-
based solutions are legendary. Companies of all sizes and in
all areas — from the biggest investment banks to the smallest
social/mobile web app startups — are using Python to run their
business and manage their data.
Course Objectives
This course is designed to cover key concepts for writing
Python code, emphasizing the design of functional and efficient
code. It will set students down the road to mastering the
intricacies of Python. After completing the course, students
should be able to read, understand, modify, and create complex
functions to perform a variety of tasks.
Program Certification Awarded Enquiries
PYTHON FOR BIG DATA
Duration
5 days Award by info@thecads.org
Info Pre-requisites
The Center for
Applied Data Science
Basic Programming skills
Program Outline
DAY TOPICS ASSIGNMENTS DURATION
Day 1 • Installing Python • Simple programs involving 8 hours
Introduction to Python • Python development workflow, basic python language
Python REPL and running Python constructs
files directly.
• Introduction to basic data types,
control structures and conditions.
• Functions, error handling and
exceptions
• Command line programs
Day 2 • Python collections such as tuples, • Command-line utility for 8 hours
More Python, lists, dictionaries and sets. counting unique words in
useful Python libraries, • List comprehensions, decorators a given set of files
etc. • Modules and libraries
• Object-oriented programming
in Python
• Useful Python libraries
Day 3 • Working with files • Processing data files in a given 8 hours
Working with different • Handling large files format, normalizing the data and
data formats: • Python libraries for reading and making it available in a different
CSV, JSON, XML writing CSV, JSON, XML and other data format. E.g. XML to CSV
data formats.
• Serializing Python data structures
Day 4 • Introduction to ‘requests’ library • Accessing Twitter API to 8 hours
Getting data from • Connecting to and downloading explore the social web
APIs and databases data from a given API
• Reading web pages using Python
• Working with databases using Python
Day 5 • Finding and extracting raw data • Web scraping 8 hours
Cleaning messy data, • Tidy data and how to make data tidy,
dealing with missing reshaping and transforming data
values, scraping the web, • Practical implementation through a
regular expressions range of Python libraries: Textual data,
dates, etc
• Fetching web pages, extracting data
from them and making the data
available in various formats.
• Processing data in parallel.
www.thecads.org The Center of Applied B-19-3, Level 19, Tower B
Data Science Sdn Bhd Northpoint Offices, Mid Valley City
1129780-U No. 1, Medan Syed Putra Utara
(FORMERLY KNOWN AS IBIG MAXE
APPLIED DATA SCIENCE CENTER SDN BHD) 59200 Kuala Lumpur, Malaysia
T +603 2201 0236