0% found this document useful (0 votes)
12 views

Meeting 11 Basic Python 3

The document discusses reading and writing files in Python using the built-in open function and Pandas library. It explains how to open a file, read the contents line-by-line, write new text to a file, copy one file to another, and load data from files into Pandas DataFrames for analysis. Specific methods and functions covered include open, read, write, read_csv, read_excel, and DataFrame.

Uploaded by

johanr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Meeting 11 Basic Python 3

The document discusses reading and writing files in Python using the built-in open function and Pandas library. It explains how to open a file, read the contents line-by-line, write new text to a file, copy one file to another, and load data from files into Pandas DataFrames for analysis. Specific methods and functions covered include open, read, write, read_csv, read_excel, and DataFrame.

Uploaded by

johanr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Basic Python: Python for Data

Science
PERTEMUAN XI

© IBM 2020
Reading Files with Open

In this section we will use Python’s built-in open function to create a file object and obtain the data from
a txt file.
We will use Python’s open function to get a file object.
We can apply a method to that object to read data from the file.

© IBM 2020
Reading Files with Open

We can open the file Example1 ". txt" as follows We use the open function.
The first argument is the file path.
This is made up of the file name and the file directory.
The second parameter is the mode; common values used include
- 'r' for reading,
- 'w' for writing
- 'a' for appending.

© IBM 2020
Reading Files with Open

We can now use the file object to obtain information about the file. We can use the data attribute name to get
the name of the file.

The result is a string that contains the name of the file.


We can see what mode the object is in using the data attribute mode.
An ‘r’ is shown representing read. You should always close the file object using the method close.

© IBM 2020
Reading Files with Open

This may get tedious sometimes, so let’s use the “with statement.”
Using a 'with' statement to open the file is better practice because it automatically closes the file.
The code will run everything in the indent block, then closes the file. This code reads the file Example1.txt; we
can use the file object File 1.
The code will perform all operations in the indent block then close the file at the end of the indent.

© IBM 2020
Reading Files with Open

The method read stores the values of the file in the variable file_stuff as a string.
You can print the file content. You can check if the file content is closed, but you cannot read from it outside the
indent.
But you can print the file content outside the indent as well.

We can print the file content. We will see the following.


When we examine the raw string, we will see the slash n; this is so Python knows to
start a new line.

© IBM 2020
Reading Files with Open

We can output every line as an element in a list using the method readlines.
The first line corresponds to the first element in the list.
The second line corresponds to the second element in the list, and so on.

© IBM 2020
Reading Files with Open

We can use the method readline to read the first line of the file.
If we run this command, it will store the first line in the variable file_stuff, then print the first line.
We can use the method read line twice: The first time it’s called, it will save the first line in the variable file_stuff
and then print the first line.
The second time it’s called it will save the second line in the variable file_stuff and then print the second line.

© IBM 2020
Reading Files with Open

We can use a loop to print out each line individually as follows.

© IBM 2020
Reading Files with Open

Let's represent every character in a string as a grid.


We can specify the number of characters we would like to read from a string as an argument to the method
readlines.

© IBM 2020
Reading Files with Open

When we use a 4 as an argument in the method readlines, we print out the first four characters in the file.
Each time we call the method, we will progress through the text.
If we call the method with the argument 16, the first 16 characters are printed out and then the new line.
If we call the method a second time, the next five characters are printed out.
Finally, if we call the method the last time with the argument 9, the last 9 characters are printed out.

© IBM 2020
Writing Files with Open

We can also write to files using the open function.


We will use Python’s open function to get a file object, to create a text file, we can apply method write to write
data to that file.
As a result, text will be written to the file.

© IBM 2020
Writing Files with Open

We can create the file Example2".txt" as follows:

We use the open function.


The first argument is the file path.
This is made up of the file name;
if you have that file in your directory, it will be over written, and the file directory.
We set the mode parameter to 'w' for writing.

© IBM 2020
Writing Files with Open

Finally, we have the file object. As before we use the "with" statement.
The code will run everything in the indent block, then close the file. We create the file object File 1. We use the
open function.
This creates a file Example2.txt in your directory. We use the method write to write data into the file.
The argument is the text we would like input into the file.

© IBM 2020
If we use the write method successively, each time it’s called, it will write to the file.
The first time it is called we will write: this is line A with a slash n to represent a new line.
The second time we call the method it will write: this is line b.
Then it will close the file.

© IBM 2020
Writing Files with Open

We can write each element in a list to a file. As before we use a "with" command and the open function to create
a file.
The list 'lines' has three elements consisting of text.
We use a “for loop” to read each element of the list lines and pass it to the variable line.

© IBM 2020
Writing Files with Open

The first iteration of the loop writes the first element of the list to the file example 2.
The second iteration writes the second element of the list, and so on.
At the end of the loop the file will be closed.

© IBM 2020
Writing Files with Open

We can set the mode to appended using a lower case "a". This will not create a new file, but just use the existing
file.
If we call the method write, it will just write to the existing file, then add: this is line C. Then close the file.

© IBM 2020
Writing Files with Open

We can copy one file to a new file as follows: First, we read the file example 1 and interact with it via the file
object "read file“. Then we create a new file example 3 and use the file object "write file" to interact with it.
The “for loop” takes a line from the file object read file and stores it in the file example 3 using the file object
write file.
The first iteration copies the first line.
The second iteration copies the second line till the end of the file is reached, then both files are closed.

© IBM 2020
Loading Data with Pandas

Dependencies or libraries are pre-written code to help solve problems, introduce pandas–a popular library for
data analysis.
We can import the library or a dependency like pandas using the following command.
We start with the import command followed by the name of the library.
We now have access to a large number of pre-built classes and functions.
This assumes the library is installed. In our lab environment, all the necessary libraries are installed.

© IBM 2020
Loading Data with Pandas

Let’s say we would like to load a csv file using the pandas built-in function “read csv.”
A csv is a typical file type used to store data.
We simply type the word pandas, then a dot and the name of the function with all the inputs.

© IBM 2020
Loading Data with Pandas

Typing pandas all the time may get tedious.


We can use the "as" statement to shorten the name of the library; in this case we use the standard abbreviation
pd.
Now we type pd and a dot followed by the name of the function we would like to use, in this case, read_csv.
We are not limited to the abbreviation pd. In this case, we use the term, banana.
We will stick with pd. Let’s examine this code more in-depth.

© IBM 2020
Loading Data with Pandas

One way pandas allows you to work with data is with a data frame. Let's go over the process to go from a csv file
to a data frame.
This variable stores the path of the csv. It is used as an argument to the read_csv function.
The result is stored to the variable df; this is short for “dataframe.” Now that we have the data in a dataframe,
we can work with it.
We can use the method head to examine the first 5 rows of a dataframe.

© IBM 2020
Loading Data with Pandas

The process for loading an excel file is similar. We use the path of the excel file.
The function reads excel. The result is a dataframe. A dataframe is comprised of rows and columns.
We can create a data frame out of a dictionary. The keys correspond to the column labels.
The values are lists corresponding to the rows. We then cast the dictionary to a dataframe using the function
DataFrame.

© IBM 2020
Loading Data with Pandas

We can see the direct correspondence between the table. The keys correspond to the table headers. The values
are lists corresponding to the rows.

© IBM 2020
Loading Data with Pandas

We can create a new dataframe consisting of one column.


We just put the dataframe name, in this case "df" and the name of the column header enclosed in double
brackets.
The result is a new dataframe comprised of the original column.

© IBM 2020
Loading Data with Pandas
You can do the same thing for multiple columns.
We just put the dataframe name, in this case "df" and the name of the multiple column headers enclosed in
double brackets.

© IBM 2020
Loading Data with Pandas

The result is a new dataframe comprised of the specified columns. One way to access unique elements is the ix
method..
You can access the 1st row and first column.
You can access the 2nd row and first column
You can access the 1st row,3rd column as follows

© IBM 2020
Loading Data with Pandas

You can access the 2nd row, 3rd column as follows. You can use the name of the column as well.
You can access the 1st row of the column named 'Artist’ as follows.
Similarly, you can access the 2nd row of the column named ‘Artist.’
You can access the 1st row of the column named 'Released’ as follows.
Finally, you can access the 2nd row of the column named 'Released.’

© IBM 2020
Loading Data with Pandas

You can also slice dataframes and assign the values to a new dataframe.
We assign the first two rows and the first three columns to the variable z
The result is a dataframe comprised of the selected rows and columns.

© IBM 2020
Loading Data with Pandas

You can also slice dataframes and assign the values to a new dataframe using the column names. \
The code assigns the first three rows and all columns in-between to the columns named 'Artist’ and 'Released.’
The result is a new dataframe z with the corresponding values

© IBM 2020
Working and Saving Data with Pandas

When we have a dataframe we can work with the data and save the results in other formats.
Consider the stack of 13 blocks of different colors.
We can see there are three unique colors.
Let’s say you would like to find out how many unique elements are in a column of a dataframe.

© IBM 2020
Working and Saving Data with Pandas

This may be much more difficult because instead of 13 elements you may have millions.
pandas has the method unique to determine the unique elements in a column of a dataframe.
Let’s say we would like to determine the unique year of the albums in the data set.
We enter the name of the dataframe, then enter the name of the column ‘Released’ within brackets.
Then we apply the method unique. The result is all of the unique elements in the column ‘Released.’

© IBM 2020
Working and Saving Data with Pandas

Let's say we would like to create a new database consisting of songs from the 1980's and after.
We can look at the column ‘Released’ for songs made after 1979, then select the corresponding columns.
We can accomplish this within one line of code in Pandas, but let’s break up the steps.

© IBM 2020
Working and Saving Data with Pandas

We can use the inequality operators for the entire dataframe in pandas. The result is a series of Boolean values.
For our case, we simply specify the column ‘Released’ and the inequality for the albums after 1979.
The result is a series of Boolean values.
The result is true when the condition is true and false otherwise.

© IBM 2020
Working and Saving Data with Pandas

We can select the specified columns in one line; we simply use the dataframe’s names, and in square brackets
we place the previously mentioned inequality and assign it to the variable df1.
We now have a new dataframe, where each album was released after 1979.

© IBM 2020
Working and Saving Data with Pandas

We can save the new dataframe using the method to_csv.


The argument is the name of the csv file.
Make sure you include a dot csv extension. There are other functions to save the dataframe in other formats.

© IBM 2020
1D Numpy Array

Numpy is a library for scientific computing. It has many useful functions. There are many other
advantageslike speed an memory.
A Python list is a container that allows you to store and access data. Each element is associated with an
index. We can access each element using a square bracket, as follows.

© IBM 2020
Numpy Array

A "numpy" array or "ndarray" is similar to a list. It's usually fixed in size and each element is of the same type, in
this case, integers.
We can cast a list to a numpy array by first importing numpy. We then cast the list as follows. We can access the
data via an index.

© IBM 2020
Numpy Array

As with a list, we can access each element with an integer and a square bracket.
The value of 'a' is stored as follows. If we check the type of the array we get "numpy.ndarray".
As numpy arrays contain data of the same type, we can use the attribute "dtype" to obtain the data-type of the
array’s elements. In this case a 64-bit integer.

© IBM 2020
Let's review some basic array attributes using the array 'a'. The attribute size is the number of elements in
the array.
As there are 5 elements, the result is 5. The next two attributes will make more sense when we get to higher
dimensions, but let's review them. The attribute "ndim” represents the number of array dimensions or the
rank of the array, in this case one.
The attribute "shape” is a tuple of integers indicating the size of the array in each dimension.

© IBM 2020
Numpy Array

We can create a numpy array with real numbers. When we check the type of the array, we get numpy.ndarray.
If we examine the attribute "dtype," we see float 64 as the elements are not integers. There are many other
attributes. Check out "numpy.org”.

© IBM 2020
Numpy Array

Let's review some Indexing and Slicing methods. We can change the first element of the array to 100, as
follows.
The arrays first value is now 100. We can change the 5-th element of the array as follows. The fifth element is
now 0.

© IBM 2020
Numpy Array

Like lists and tuples, we can slice a numpy array.


The elements of the array correspond to the following index. We can select the elements from 1 to 3 and assign
it to a new numpy array 'd' as follows.
The elements in 'd' correspond to the index. Like lists, we do not count the element corresponding to the last
index.
We can assign the corresponding indexes to new values as follows. The array 'c' now has new values. See the
labs or numpy.org for more examples of what you can do with numpy.

© IBM 2020
Numpy Array

Numpy makes it easier to do many operations that are commonly performed in data science.
These same operations are usually computationally faster and require less memory in numpy compared to
regular Python.
Let's review some of these operations on 1 dimensional arrays.
We will look at many of the operations in the context of Euclidian vectors to make things more interesting.

© IBM 2020
Numpy Array -
Vector

Vector addition is a widely used operation in data science. Consider the vector 'u' with two elements; the
elements are distinguished by the different colors. Similarly, consider
the vector 'v' with two components. In vector addition, we create a new vector in this case 'z'.

The first component of 'z' is the addition of the first component of vectors 'u' and 'v'. Similarly, the
second component is the sum of the second components of 'u' and 'v'.
This new vector 'z' is now a linear combination of the vector 'u' and 'v'.

© IBM 2020
Numpy Array -
Vector

Representing vector addition with line segment or arrows is helpful. The first vector is represented in red; the
vector will point in the direction of the two components.
The first component of the vector is one; as a result, the arrow is offset one unit from the origin in the horizontal
direction.
The second component is 0. We represent this component in the vertical direction, as this component is zero the
vector does not point in the horizontal direction.

© IBM 2020
Numpy Array -
Vector

We represent the second vector in blue. The first component is zero, therefore, the arrow does not point in the
horizontal direction. The second component is one.
As a result, the vector points in the vertical direction one unit. When we add the vector 'u' and 'v' we get the new
vector 'z’.

© IBM 2020
Numpy Array -
Vector

We add the first component, this corresponds to the horizontal direction.


We also add the second component. It's helpful to use the tip-to-tail method when adding vectors, placing
the tail of vector 'v' on the tip of vector 'u’.

© IBM 2020
Numpy Array -
Vector

The new vector 'z' is constructed by connecting the base of the first vector 'u’ with the tail of the second 'v’.

© IBM 2020
Numpy Array -
Vector
We can also perform vector addition with one line of numpy code. It would require multiple lines to perform
vector subtraction on two lists, as shown on the right side of the screen.
In addition, the numpy code will run much faster. This is important if you have lots of data. We can also perform
vector subtraction by changing the addition sign to a subtraction sign.
It would require multiple lines to perform vector subtraction on two lists, as shown on the right side of the
screen.

The following 3 lines of code will add the two lists and place the result in the list 'z'.

© IBM 2020
Numpy Array -
Vector

Vector multiplication with a Scalar is another commonly performed operation. Consider the vector 'y', each
component is specified by a different color.
We simply multiply the vector by a scaler value, in this case, 2. Each component of the vector is multiplied by 2;
in this case, each component is doubled.

© IBM 2020
Numpy Array -
Vector

We can use the line segment or arrows to visualize what’s going on. The original vector y is in purple.

© IBM 2020
Numpy Array -
Vector

After multiplying it by a scalar value of 2, the vector is stretched out by two units, as shown in red. The new vector
is twice as long in each direction.

© IBM 2020
Numpy Array -
Vector

Vector multiplication with a scalar only requires one line of code using Numpy. It would require multiple lines
to perform the same task as shown with python lists, as shown on the right side of the screen.
In addition, the operation would also be much slower.

© IBM 2020
Hadamard

Hadamard product is another widely used operation in data science. Consider the following two vectors 'u' and
'v’.
The Hadamard product of 'u' and 'v' is a new vector 'z'. The first component of 'z' is the product of the first
element of 'u' and 'v'. Similarly, the second component is the product of the second element of 'u' and 'v'. The
resultant vector consists of the entry wise product of 'u' and 'v'.

© IBM 2020
Hadamard

We can also perform Hadamard product with 1 line of code in Numpy.


It would require multiple lines to perform Hadamard product on two lists as shown on the right side of the screen.

© IBM 2020
Dot

The dot product is another widely used operation in data science, consider the vector 'u' and 'v’.
The dot product is a single number given by the following term and represents how similar two vectors are. We
multiply the first component from 'v' and 'u’.
We then multiply the second component and add the result together.
The result is a number that represents how similar the two vectors are.

© IBM 2020
Dot

We can also perform dot product using the numpy function "dot” and assign it the variable "result” as follows.

Consider the array 'u'. The array contains the following elements. If we add a scalar value to the array, numpy
will add that value to each element. This property is known as broadcasting.

© IBM 2020
Universal Function

A universal function is a function that operates on ndarrays.


We can apply a universal function to a Numpy array.
Consider the arrays 'a'. We can calculatethe mean or average value of all the elements in 'a' using the method
"mean”. This corresponds to the average of all the elements. In this case the result is zero.

© IBM 2020
Universal Function

We can use numpy to create functions that map numpy arrays to new numpy arrays.
Let's implement some code on the left side of the screen and use the right side of the screen to demonstrate
what's going on. We can access the value of pi in Numpy as follows.
We can create the following numpy array in Radians.

This array corresponds to the following vector. We can apply the function "sine" to the array x and assign
the values to the array y; this applies the sine function to each element in the array.
This corresponds to applying the sine function to each component of the vector.
The result is a new array y where each value corresponds to a sine function being applied to each element
in the array x.

© IBM 2020
Linespace
A useful function for plotting mathematical functions is "linespace”
Linespace returns evenly spaced numbers over a specified interval. We specify the starting point of the
sequence.
The ending point of the sequence. The parameter "num" indicates the Number of samples to generate, in this
case, 5. The space between samples is 1.

© IBM 2020
We can use the function linespace to generate 100 evenly spaced samples from the interval 0 to 2 pi.
We can use the Numpy function sine to map the array x to a new array y.
We can import the library pyplot as plt to help us plot the function. As we are using a Jupyter notebook, we use
the command "matplotlib inline” to display the plot.
The following command plots a graph. The first input corresponds to the values for the horizontal or x-axis.
The second input corresponds to the values for the vertical or y-axis.

© IBM 2020
2D Arrays

This section will focus only on 2D arrays, but you can use Numpy to build arrays of much higher dimensions.
Consider the list "a".

The list contains three nested lists each of equal size. We can cast the list to a numpy array as follows.
It is helpful to visualize the Numpy array as a rectangular array; each nested list corresponds to a different row of
the matrix.

© IBM 2020
2D Arrays - ndim

We can use the attribute "ndim" to obtain the number of axes or dimensions referred to as the rank.
The term rank does not refer to the number of linearly independent columns like a matrix. It's useful to think of
"ndim" as the number of nested lists.

As with the 1d array, the attribute "shape" returns a tuple. It’s helpful to use the rectangular representation as
well.
The first element in the tuple corresponds to the number of nested lists contained I the original list or the
number of rows in the rectangular representation, in this case 3.
The second element corresponds to the size of each of the nested lists or the number of columns in the
rectangular array 0.

© IBM 2020
2D Arrays - ndim
The second element corresponds to the size of each of the nested lists or the number of columns in the
rectangular array 0.

The convention is to label axis 0 and axis 1 as follows.

© IBM 2020
2D Arrays - ndim

We can also use the attribute size to get the size of the array. We see there are three rows and three columns.
Multiplying the number of columns and rows together we get the total number of elements, in this case 9.

© IBM 2020
2D Arrays - ndim

We can use rectangular brackets to access the different elements of the array.
The following image demonstrates the relationship between the indexing conventions for the list like
representation.
The index in the first bracket corresponds to the different nested lists, each a different color.
The second bracket corresponds to the index of a particular element within the nested list.

© IBM 2020
2D Arrays - ndim
Using the rectangular representation, the first index corresponds to the row index.
The second index corresponds to the column index.
Consider the following syntax.

© IBM 2020
2D Arrays - ndim

This index corresponds to the second row.

And this index, the third column. The value is 23.

© IBM 2020
2D Arrays - ndim

Consider this example. This index corresponds to the first row.

And the second index corresponds to the first column, and a value of 11.

© IBM 2020
2D Arrays - Slicing
We can also use slicing in numpy arrays.
The first index corresponds to the first row.
The second index accesses the first two columns.

© IBM 2020
Consider this example.
The first index corresponds to the last two rows.
The second index accesses the last column.
We can also add arrays; the process is identical to matrix addition.
Consider the matrix X; each element is coloured differently.
Consider the matrix Y, similarly each element is coloured differently.
We can add the matrices.

Then we define the second array Y. We add the arrays.


The result is identical to matrix addition.

© IBM 2020
2D Arrays - Multiply

Multiplying a Numpy array by a scaler is identical to multiplying a matrix by a scaler.


Consider the matrix Y, if we multiply the matrix by the scaler 2 we simply multiply every element in the
matrix by 2.
The result is a new matrix of the same size where each element is multiplied by two.

© IBM 2020
2D Arrays - 2D
Arrays - Multiply
Consider the array y; we first define the array.
We multiply the array by a scaler as follows and assign it to the variable Z.
The result is a new array where each element is multiplied by two.

Multiplication of two arrays corresponds to an element-wise product or Hadamard product.4


Consider array X and array Y.

© IBM 2020
2D Arrays - Multiply

Hadamard product corresponds to multiplying each of the elements in the same position i.e., multiplying elements
contained in the same color boxes together
The result is a new matrix that is the same size as matrix Y or X.
Each element in this new matrix is the product of the corresponding elements in X and Y.

Consider the array X and Y.


We can find the products of two arrays X and Y in one line and assign it to the variable Z as follows. The result
is identical to Hadamard product.

© IBM 2020
2D Arrays - Multiply

We can also perform matrix multiplication with numpy arrays. Matrix multiplication is a little more complex, but
let's provide a basic overview.
Consider the matrix "A", where each row is a different colour.
Also, consider the matrix "B", where each column is a different colour.

© IBM 2020
2D Arrays - Multiply
In linear algebra, before we can multiply matrix "A" by matrix "B" we must make sure that the number of
columns in matrix "A", in this case 3, is equal to the number of rows in matrix "B", in this case 3.
For matrix multiplication to obtain the i-th row and j-th column of the new matrix we take the dot product of
the i-throw of "A" with the j-th columns of "B".
For the 1st column 1st row, we take the dot product of the 1st row of "A" with the first column of "B" as
follows. The result is 0.

© IBM 2020
2D Arrays - Multiply

For the first row and the second column of the new matrix we take the dot product of the first row of the
matrix "A" but this time we use the second column of matrix "B"; the result is 2.

For the second row and the first column of the new matrix we take the dot product of the second row of the
matrix "A" with the first column of matrix "B"; the result is 0.

© IBM 2020
2D Arrays - Multiply

Finally, for the second row and the second column of the new matrix we take the dot product
of the second row of the matrix "A" with the second column of matrix "B"; the result is 2.

© IBM 2020
2D Arrays - Multiply

In numpy we can define the Numpy arrays "A" and "B".


We can perform matrix multiplication and assign it to array "C".
The result is the array "C". It corresponds to the matrix multiplication of array "A" and "B".
There is a lot more you can do with it in numpy.

© IBM 2020
LAB

Please Continue to lab:


- Module 4 – Working with Data in Phtyon -
https://courses.cognitiveclass.ai/courses/course-
v1:Cognitiveclass+PY0101EN+v2/courseware/c6143d9ff5764057a91e53fa8a3a6d
ff/00c2cd249d94403688ae979661e8eebf/
- Module 5 Working with Numpy Arrays
https://courses.cognitiveclass.ai/courses/course-
v1:Cognitiveclass+PY0101EN+v2/courseware/c6143d9ff5764057a91e53fa8a3a6d
ff/00c2cd249d94403688ae979661e8eebf/

© IBM 2020

You might also like