Data Type in Python
Second Course:
Importing data in python
In this course we will learn to import data from large variety of sources
for example,
(i) flat files such as .txts and .csvs;
(ii) files native to other software such as Excel spreadsheets, Stata, SAS and
MATLAB files;
First off, we're going to learn how to import basic text files
which we can broadly classify into 2 types of files –
1. those containing plain text,
such as the opening of Mark Twain's novel The
Adventures of Huckleberry Finn, which you can see
here,
2. Table data
column is a characteristic or feature, such
as gender, cabin and 'survived or not'. The
latter is known as a flat file
open a connection to the file. To
do so,
you assign the filename to a
variable as a string, pass the
filename to the function
open and also pass it the
argument mode equals 'r',
line3: assign text from a file to a variable text by applying a method read
now print and check the text
It is good to know how to write
data on file but we will not use
it in course
You can avoid having to
close the connection to the file by
What you're doing here is called 'binding' a variable in the context manager construct;
while still within this construct, the variable file will be bound to open(filename, 'r'). It is
best practice to use the with statement as you never have to concern yourself with
closing the files again.
The importance of flat files in data
science
Flat Files:
Flat files are basic text files containing
row or record is a unique passenger onboard
and each column is a feature or attribute, such
as
name, gender and cabin.
It is also essential to note that a flat file can
have a header, such as in 'titanic dot csv',
It will be important to know whether or not your
file has a header as it may alter your data
import.
File extension:
The values in each row are separated by
commas. Another common extension for a
flat file is dot txt, which means a text file.
Values in flat files can be separated by
characters or sequences of characters
other than commas, such as a tab, and
the character or characters in question is
called a delimiter.
See here an example of a tab-
delimited file. The data consists of the
famous MNIST digit recognition
images, where
each row contains the pixel values of a
given image. Note that all fields in the
MNIST data are numeric, while the
'titanic dot csv' also contained strings.
If they consist entirely of numbers and
we want to store them as a numpy array,
we could use numpy.
If, instead, we want to store the data in a
dataframe, we could use pandas.
In the rest of this Chapter, you'll learn
how to import flat files that contain only
numerical data, such as the MNIST
data, and import flat files that contain
both numerical data and strings, such as
'titanic dot csv'.
Importing flat files using NumPy
if you want to import a flat file and assign it to a variable? If all the data are numerical,
you can use the package numpy to import the data as a numpy array.
Why NumPy?
numpy arrays are often essential for other packages, such as
- scikit-learn, a popular Machine Learning package for Python.
Numpy itself has a number of built-in functions that make it far easier and more efficient
for us to import data as arrays.
Enter the NumPy functions
- loadtxt and
- genfromtxt
To use either of these we
first need to import
NumPy.
We then call loadtxt and
pass it the filename as the
first argument, along with
the delimiter as the 2nd
argument.
Note that the default
delimiter is any white
space so we’ll usually
need to specify it explicitly.
If You want to set usecols equals the list containing ints 0 and 2.
You can also import different datatypes into NumPy arrays: for example, setting the
argument dtype equals 'str' will ensure that all entries are imported as strings.
This can we see when we have mix data
Strings and floats in table as below
Importing flat files using pandas
prompted Wes McKinney to develop
the pandas library for Python.
Nothing speaks to the project of
pandas more than the
documentation itself:
As Hadley Wickham tweeted,
"A matrix has rows and
columns. A data frame has
observations and variables."
For all of these below reasons, it is now
standard and best practice in Data
Science to use pandas to import flat
files as DataFrames.
To use pandas, you first need to import it.
Then, if we wish to import a CSV in the most basic case all we need to do is to call the
function read_csv()
and supply it with a single argument, the name of the file. Having assigned the
DataFrame to the variable data, we can check the first 5 rows of the DataFrame,
including the header, with the command 'data.head'.