CM2015 – Programming with Data [SIM – UOL]
Topic 1
Introduction to Data Programming
Learning Outcomes
After completing this topic and the recommended reading, you should be able
to:
• Set up and run Jupyter Notebook on a Windows, Mac or Linux operating
system.
• Use Jupyter Notebook to write and edit code.
• Write and explain simple Python programs using variables and
mathematical operators.
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 1
CM2015 – Programming with Data [SIM – UOL]
1. Introduction to Data Programming
Data (definition)
• “Facts and statistics collected together for reference or analysis.”
[Oxford English Dictionary]
• “Information, especially facts or numbers, collected to be examined and
considered and used to help decision-making, or information in an
electronic form that can be stored and used by a computer.”
[Cambridge Dictionary]
• “Factual information (such as measurements or statistics) used as a basis
for reasoning, discussion, or calculation.”
[Merriam-Webster]
• “Data are individual facts, statistics, or items of information, often
numeric, that are collected through observation. In a more technical
sense, data are a set of values of qualitative or quantitative variables
about one or more persons or objects.”
[Wikipedia]
Information (definition)
• “Facts provided or learned about something or someone.”
[Oxford English Dictionary]
• “Facts or details about a situation, person, event, etc.”
[Cambridge Dictionary]
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 2
CM2015 – Programming with Data [SIM – UOL]
• “Knowledge obtained from investigation, study, or instruction.”
[Merriam-Webster]
• “Knowledge communicated or received concerning a particular fact or
circumstance; knowledge gained through study, communication,
research, instruction, etc.”
[Dictionary.com]
Data vs. Information
• Data
o Raw, unorganised facts that need to be processed.
o Unusable until it is organised.
• Information
o Created when data is processed, organised, and structured.
o Needs to be put in an appropriate context in order to become
useful.
Data Science
Data Processing Information
Programming and Data
• Tasks to undertake for data programming
o Data collection
o Data processing (wrangling)
o Data visualisation
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 3
CM2015 – Programming with Data [SIM – UOL]
o Train and apply algorithms from fields such as machine learning,
statistics, data mining, optimisation, image processing, etc.
• Programming
o The process of producing an executable computer program that
performs a specific task.
o The purpose is to find a sequence of instructions that automate the
implementation of the task for solving a given problem.
• Programming Language
o The source code of a program is written in one or more languages
that are intelligible to humans, rather than machine code, which is
directly executed by the CPU.
o Python
§ https://www.python.org/
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 4
CM2015 – Programming with Data [SIM – UOL]
2. Introduction to Development Environments
Source-code Editors
• Source-code editor, or programming text editor, is a fundamental
programming tools designed specifically for editing source code of
computer programs.
• It highlights the syntax elements of your programs, and provides many
features that aid in your program development.
• Examples:
o Visual Studio Code [https://code.visualstudio.com/]
o Notepad++ (Windows only) [https://notepad-plus-plus.org/]
o Vim [https://www.vim.org/]
o Sublime Text (not open source) [https://www.sublimetext.com/]
o Atom [https://atom.io/]
o Emacs [https://www.gnu.org/software/emacs/]
o TextMate (Macs only) [https://macromates.com/]
o Jupyter
§ https://jupyter.org/
Integrated Development Environments (IDEs)
• Integrated development environment is a software application that
provides comprehensive facilities to computer programmers for software
development.
• An IDE normally consists of at least a source code editor, build
automation tools and a debugger.
• Examples:
o Spyder [https://www.spyder-ide.org/]
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 5
CM2015 – Programming with Data [SIM – UOL]
o RStudio [https://rstudio.com]
o Eclipse [https://www.eclipse.org/]
o Microsoft Visual Studio [https://visualstudio.microsoft.com/vs/]
o Wing Python IDE [https://wingware.com]
Markdown / Markup Languages
• Markdown is a markup language that consists of a set of rules for adding
formatting elements to plain text documents
o Boldface, italics, headers, paragraphs, lists, code blocks, images,
etc.
o https://www.markdownguide.org/
• Invented by John Gruber
o The overriding design goal for Markdown’s formatting syntax is to
make it as readable as possible.
o The idea is that a Markdown-formatted document should be
publishable as-is, as plain text, without looking like it’s been
marked up with tags or formatting instructions
• Examples
o HTML; XML; LaTeX
Version Control Systems
• Version Control is a class of systems responsible for managing changes
to computer programs, documents, large websites, or other collections of
information.
• Version Control Systems (VCS) are software tools that help software
teams manage changes to source code over time.
o Undertakes the tedious task of keeping track of the changes to all
project’s files and who made them
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 6
CM2015 – Programming with Data [SIM – UOL]
o Allows users to recover any previous version at any given time
• Examples:
o Subversion [https://subversion.apache.org]
o Git
§ https://git-scm.com/
o GitHub
§ https://github.com/
Package/Environment Manager
• Package manager, or package management system, is a collection of
software tools that automates the process of installing, upgrading,
configuring, and removing computer programs for a computer in a
consistent manner. Also deals with packages, distributions of software
and data in archive files.
• Environment manager enables personalised, consistent desktop
environments without cumbersome roaming profiles or scripts.
• Example:
o Anaconda
§ https://www.anaconda.com/
Installing Anaconda
• Go to Anaconda, download Anaconda Individual Edition
o https://www.anaconda.com/products/distribution
• Packages include
o conda
§ package management system
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 7
CM2015 – Programming with Data [SIM – UOL]
o pandas, scikit-learn, nltk
§ packages for data science
o Anaconda Navigator
§ a graphical user interface
o QtConsole
§ an interactive Python environment
o Spyder
§ a standard cross-platform IDE for Python
o Jupyter Notebook
§ an interactive web-browser based application for creating
and sharing code
Package Installer for Python (pip)
• pip is the de facto and recommended package-management system
written in Python and is used to install and manage software packages.
• It connects to an online repository of public packages, called the Python
Package Index.
• We use pip to install packages from the Python Package Index
• Examples
o pip install beautifulsoup4
o pip install -r dependencies
§ Install packages based on dependencies in code
o pip freeze
§ See all the packages installed
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 8
CM2015 – Programming with Data [SIM – UOL]
3. Introduction to Python
• Open-source, interpreted, high-level, object-oriented, general-purpose,
easy to download, write and read
• Named for the British comedy group Monty Python
• Simpler language, allow us to focus less on the language and more on
problem solving
• Many of the best parts of other languages are included
o Data structures
o Controls
o Many packages for common tasks
Variables
• Variable is a named piece of memory whose value can change during the
running of the program; constant is a value which cannot change as the
program runs.
o Python doesn’t use constant
• We use variable names to represent objects (number, data structures,
functions, etc.) in our program, to make our program more readable.
o All variable names must be one word, spaces are never allowed.
o Can only contain alpha-numeric characters and underscores.
o Must start with a letter or the underscores character.
o Cannot begin with a number.
o Case-sensitive
o Standard way for most things named in Python is lower with under
§ Lower case with separate words joined by an underscore
Comments
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 9
CM2015 – Programming with Data [SIM – UOL]
• Not processed by the computer, valued by other programmers.
• Header comments
o Appear at beginning of a program or a module
o Provide general information
• Step comments or in-line comments
o Appear throughout program
o Explain the purpose of specific portion of code
• Often comments delineated by
o // comment goes here
o /* comment goes here */
o # Python uses this
Python Operations
• Assignment Operator
o “=”
o Example:
§ a = 67890/12345
# compute the ratio, store the result in ram, assign to a
# the value of a is 5.499392
§ b=a
# b pointing to value of a
• Output
o “print()”
o Example:
§ print(‘Hello World!’) # print the string literals
§ print(a) # print the value of a
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 10
CM2015 – Programming with Data [SIM – UOL]
Data Types in Python
• Declaration of variables in Python is not needed
o Use an assignment statement to create a variable
• Float
o Stores real numbers
o a = 4.6
o print(type(a))
• Integer
o Stores integers
o b = 10
o print(type(b))
• Conversion
o int(a) # convert float to int => 4
o float(b) # convert int to float => 10.0
• Basic arithmetic operators
o 3+2 # Addition => 5
o 5–2 # Subtraction => 3
o 5 * -2 # Multiplication => -10
o 5 / 2.5 # Division => 2.0
o 2**2 # Exponentiation => 4
o 10 % 3 # Modulus => 1
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 11
CM2015 – Programming with Data [SIM – UOL]
o 10 // 3 # Floor Division => 3
• String
o Stores strings
o phrase = ‘All models are wrong, but some are useful.’
o phrase[0:3] # slicing character 0 up to 2
=> All
o phrase.find(‘models’) # find the starting index of word
=> 4
o phrase.find(‘right’) # word not found
=> -1
o phrase.lower() # set to lower case
=> ‘all models are wrong, but
some are useful.’
o phrase.upper() # set to upper case
=> ‘ALL MODELS ARE
WRONG, BUT SOME ARE
USEFUL.’
o phrase.split(‘,’) # split strings into list, base on delimiter
=> [‘All models are wrong’,
‘ but some are useful.’]
• Boolean
o Stores logical or Boolean values of TRUE or FALSE
o k=1>3
o print(k)
o print(type(k))
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 12
CM2015 – Programming with Data [SIM – UOL]
• Logical operators
o Conjunction (AND): “and”
o Disjunction (OR): “or”
o Negation (NOT): “not”
a b a and b a or b not a
T T T T F
T F F T F
F T F T T
F F F F T
Data Structures in Python
• Tuples
o Store ordered collection of objects
o Immutable: elements cannot be modified, added or deleted
o Written with round brackets “( )”
§ tuple1 = (“apple”, “banana”, “cherry”, “orange”, “kiwi”,
“melon”, “mango”)
§ tuple2 = (“Handsome Koh”, 4896, 13.14, True)
o Accessing elements by indexing
§ tuple1[0] # first element index => ‘apple’
§ tuple1[-1] # last element index => ‘mango’
§ tuple1[2:5] # range of elements => (‘cherry’, ‘orange’,
‘kiwi’)
• Lists
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 13
CM2015 – Programming with Data [SIM – UOL]
o Store ordered collection of objects; mutable
o Written with square brackets “[ ]”
§ list1 = [“apple”, “banana”, “cherry”]
§ list2 = [“Handsome Koh”, 4896, 13.14, True]
o Changing elements
§ list1.append(“orange”) # add to last position
=> [‘apple’, ‘banana’, ‘cherry’,
‘orange’]
§ list1[2] = “coconut” # modify index element
=> [‘apple’, ‘banana’, ‘coconut’,
‘orange’]
§ list1.remove(“apple”) # delete elements
=> [‘banana’, ‘coconut’, ‘orange’]
§ list1.insert(2, “durian”) # insert element at position
=> [‘banana’, ‘coconut’, ‘durian’,
‘orange’]
• Sets
o Store unordered, unindexed, nonduplicates collection of objects
o Written with square brackets “{ }”
§ set1 = {“apple”, “banana”, “cherry”}
§ set2 = {“apple”, “samsung”}
o Set operations
§ set1.union(set2) # Union both sets
=> {‘apple’, ‘banana’, ‘cherry’,
‘samsung’}
§ set1.intersection(set2) # Intersect both sets
=> {‘apple’}
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 14
CM2015 – Programming with Data [SIM – UOL]
• Dictionaries
o Store unordered collection of objects
o Written with square brackets “{ }”, and “key:value” pair
§ thisdict = {“brand”: “Ford”, “model”: “Mustang”,
“year”: 1964}
o Accessing/modifying elements by key name
§ thisdict[“model”] => ‘Mustang’
§ thisdist[“year”] = 2018 => {‘brand’: ‘Ford’,
thisdist[“color”] = “red” ‘model’: ‘Mustang’,
‘year’: 2018,
‘color’: ‘red’}
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 15
CM2015 – Programming with Data [SIM – UOL]
4. Introduction to Jupyter Notebook
• Jupyter Notebook is a web-based interactive computing platform.
• “Julia” + “Python” + “R”
• Integrate code and output into a single document contains:
o Live code, mathematical equations, visualisations, and
explanatory/narrative text, interactive dashboards and other media
• Can be easily shared
o Notebook files have “.ipynb” extension
o Export to “.html” and “.pdf” extensions
• Launch “Jupyter Notebook” from “Anaconda Navigator”
• Create new notebook
o “File” à “New Notebook” à “Python 3”
• Exporting notebook
o “File” à “Download as” à “HTML (.html)”
o “File” à “Print Preview” (for PDF)
• Shutting Down Jupyter
o “File” à “Close and Halt”
o Quit
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 16
CM2015 – Programming with Data [SIM – UOL]
5. Exercises
1.301 Practice Exercises (Coursera)
• Refers to “1.301 part-1.html”
1.302 A bit more Python – our first downloadable notebook!
(Coursera)
• Refers to “1.302 pythonPractice.html”
1.304 World’s Population
• Refers to “1.304 Topic 1 - Lab.html”
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 17
CM2015 – Programming with Data [SIM – UOL]
6. Practice Quiz
• Work on Practice Quiz 01 posted on Canvas.
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 18
CM2015 – Programming with Data [SIM – UOL]
Useful Resources
•
o http://
Prepared by Koh Chung Haur @ 2022 (version 2022.1) 19