Fds Unit - III
Fds Unit - III
SYLLABUS
Data Loading, Storage, and File Formats : Reading and Writing Data in
Text Format, Reading Text Files in Pieces, Writing Data Out to Text
Format, Manually Working with Delimited Formats, JSON Data, XML and
HTML: Web Scraping, Binary Data Formats, Using HDF5 Format,
Reading Microsoft Excel Files, Interacting with Databases, Storing and
Loading Data in MongoDB
Reading and Writing of data in text formats
Python become a beloved language for text and files because of its simple
syntax for interaction with files, data structure and it convenient features
like list,tuples, series, data frames, packing and unpacking.
Here we are going to taking the support of CSV, TEXT, EXCEL, JSON
files to directly creating the data set instead of creating the data sets by
using data structures.
CSV FILE: CSV file is delimited text file that uses commas to separate
the values
Each line of the file is data record, each data record contains one or
more fields separated by commas.
output:
CSE IT CSD
0 501 1201 4401
1 502 1202 4402
2 503 1203 4403
csv file format will never allow any beautifications to the current working
file.
Now we will work with read_table.
import pandas as pd
data = pd.read_table(r’C:\Users\SREE\Desktop\fds prog\s1.csv')
print(data)
CSE,IT,CSD
0 501,1201,4401
1 502,1202,4402
2 503,1203,4403
we have any attribute called sep which is used to remove the comma in the
read_table method.
import pandas as pd
data = pd.read_table(r'C:\Users\SREE\Desktop\fds prog\s1.csv',sep = ',')
print(data)
Output
CSE IT CSD
0 501 1201 4401
1 502 1202 4402
2 503 1203 4403
we also have an attribute called header which is gives the index to the
column names. Here we are going to assign the header with None to
provide default (or) integers.
import pandas as pd
data = pd.read_csv(r'C:\Users\SREE\Desktop\fds prog\s1.csv',header =
None)
print(data)
OUTPUT
0 1 2
0 CSE IT CSD
1 501 1201 4401
2 502 1202 4402
3 503 1203 4403
We are having a special attribute to change the column name which is
called as names. We also have another to remove rows which is called
skiprows and we have to assign some value to remove a specified number
of rows.
import pandas as pd
data = pd.read_csv(r'C:\Users\SREE\Desktop\fds prog\s1.csv',names =
[“CSD",“DS"],skiprows = 1)
print(data)
OUTPUT
CSE IT DS
0 501 1201 4401
1 502 1202 4402
2 503 1203 4403
Major advantages of the use of reading and writing different formats
of files (CSV,JSON,TEXT,EXCEL…):
User friendly interface(working directly with raw data).
Easy retrievals.
Permanently stored.
Easy appending.
Reading text file in pieces:
While processing a large file and to figure out the right set of arguments
(or) rows, we want read a small pieces of the file through chunks.
We have attribute like nrows and chunksize to reduce the large file into
smaller chunks for easy processing of the data.
Here we need to create a empty csv file in the current working directory,
now let us consider the empty csv file as Book2.csv.
import pandas as pd
data = pd.DataFrame({'a':[2020,2021,2022,2023,2024],'b':
['JAN','FEB','MAR','APR','MAY'],'c':['MON','TUE','WED','THU','FRI']})
data.to_csv(r'C:\Users\SREE\Desktop\fds prog\write.csv')
import pandas as pd
import sys
data=pd.read_csv(r'C:\Users\SREE\Desktop\fds prog\s1.csv')
print(data)
data.to_csv(r'C:\Users\SREE\Desktop\fds prog\w4.csv')
k = data.to_csv(sys.stdout)
print(k)
output:
a b c
0 1 11 111
1 2 22 222
2 3 33 333
,a ,b ,c
0,1,11,111
1,2,22,222
2,3,33,333
None
k1 = data.to_csv(sys.stdout,sep = '|')
print(k1)
output:
|a|b|c
0|1|11|111
1|2|22|222
2|3|33|333
None
Modification of the table:
Now let us modify the existing table by opening Book3.csv and adding
another column with name ‘d’ and filling the it with data along with
NaN.
import pandas as pd
data=pd.read_csv(r'C:\Users\SREE\Desktop\fds prog\s1.csv')
print(data)
output:
a b c d
0 1 11 111 NaN
1 2 22 222 2222.0
2 3 33 333 NaN
Now we can manipulate the table by replacing NaN with NULL which
is done by using na_rep attribute which is available in to_csv() method.
data.to_csv(sys.stdout,na_rep = "NULL")
print(data)
output:
a b c d
0 1 11 111 NaN
1 2 22 222 2222.0
2 3 33 333 NaN
We can also change the NULL values with our required values.
data.to_csv(sys.stdout,na_rep = 77)
print(data)
output:
a b c d
0 1 11 111 77
1 2 22 222 2222.0
2 3 33 333 77
We can also alter the table by removing the indexing and header by using
index and header attributes which are available in to_csv() method.
But to remove the index and header only by assigning the attribute with
‘False’.
data.to_csv(sys.stdout, sep='|', na_rep=77, index=False, header=False)
output:
1|11|111|77
2|22|222|2222.0
3|33|333|3333.0
Manually working delimited format:
Normally we can also work manually to read (or) write the files without
using the in built methods like read_csv(),read_table(),to_csv().
Now in this case we can use CSV module to work manually with
delimited files.
To work with the CSV module first we need to import the module. Now
let us see how we should work manually with delimited files.
import pandas as pd
import csv
f = open(r'C:\Users\SREE\Desktop\fds prog\s1.csv')
out = csv.reader(f)
for i in out:
print(i)
OUTPUT
['CSE', 'IT', 'CSD', 'AIDS']
['501', '1201', '4401', '5401']
['502', '1202', '4402', 'NaN']
['503', '1203', '4403', '5403']
line = list(csv.reader(open(r'C:\Users\SREE\Desktop\fds prog\s1.csv')))
print(line)
Output
[['CSE', 'IT', 'CSD', 'AIDS'], ['501', '1201', '4401', '5401'], ['502', '1202',
'4402', 'NaN'], ['503', '1203', '4403', '5403']]
JSON[JAVA SCRIPT OBJECT NOTATION]
JSON (JavaScript Object Notation) is a lightweight data-interchange
format.
It is easy for humans to read and write.
JSON requires less memory to store the data.
It is based on a subset of the JavaScript Programming Language Standard.
The most commonly used message format for application irrespectively to
language and interpretability.
JSON is very similar to python dict object and it contains a group of key-
value pairs.
It has become one of the standard formats for sending data by HTTP
request between web browsers and other applications.
JSON Module:
Python has a built-in package called JSON, which can be used to
work with JSON data. It's done by using the JSON module.
JSON module is to convert the JSON form to python dict object and
python dict object to JSON form.
Serialization:
The process of converting an object from python supported form to
either file supported (or) network supported form.
It is also defined as
For serialization purpose we use dump() and dumps() methods which
are available in json module.
dump():The dump() method is used when the Python objects have to be
stored in a file.
dumps():The dumps() is used when the objects are required to be in string
format and is used for parsing, printing, etc, .
import json
sdetail = { 'sname':'kumar',
'age':21,
'per':94.67,
'ismarried':False,
'have':None}
injson = json.dumps(sdetail)
print(injson)
Output:
{"sname": "kumar", "age": 21, "per": 94.67, "ismarried": false,
"have": null}
import json
sdetail = { 'sname':'kumar',
'age':21,
'per':94.67,
'ismarried':False,
'have':None}
injson = json.dumps(sdetail,indent = 4)
print(injson)
OUTPUT
{
"sname": "kumar",
"age": 21,
"per": 94.67,
"ismarried": false,
"have": null
}
To create a JSON file with python dict object we use dump() method
which is available in json module
import json
sdetail = { 'sname':'kumar',
'age':21,
'per':94.67,
'ismarried':False,
'have':None}
injson = json.dumps(sdetail,indent = 4)
with open('studentinfo.json','w') as f:
json.dump(sdetail,f,indent = 4)
Deserialization:
The process of converting an object from either file support form (or)
network supported form to python supported form.
It is also defined as the process of decoding the data that is in JSON
format into native data type.
For deserialization purpose we use loads() and load() which are
available in json module.
loads():The loads() method is used to convert json string to python dict
object.
load():The load() method is used when we need to read a JSON format
from a file and convert into python dict object.
import json
jobj='''{"sname":"kumar", "age":20, "per":80.2, "ismarried":false,
"havegf":null}'''
inpyth=json.loads(jobj)
print(inpyth)
print('Student name::', inpyth['sname'])
print('Student age::', inpyth['age'])
print('Student percentage::', inpyth['per'])
print('Is married::', inpyth['ismarried'])
print('have Girl Friend::', inpyth['havegf'])
OUTPUT
{'sname': 'kumar', 'age': 20, 'per': 80.2, 'ismarried': False, 'havegf': None}
Student name:: kumar
Student age:: 20
Student percentage:: 80.2
Is married:: False
have Girl Friend:: None
Web Scraping: Web scraping is the process of gathering information
from the Internet that means huge data.
XML Parser: XML parser is a software library or a package that
provides interface for client applications to work with XML documents. It
checks for proper format of the XML document and may also validate the
XML documents.
There are two types of XML parser-
Dom Parser − Parses an XML document by loading the complete
contents of the document and creating its complete hierarchical tree in
memory.
SAX Parser − Parses an XML document on event-based triggers.
Does not load the complete document into the memory.
DOM: A DOM document is an object which contains all the information
of an XML document. It is composed like a tree structure. The DOM Parser
implements a DOM API. This API is very simple to use.
Features of DOM Parser:
A DOM Parser creates an internal structure in memory which is a DOM
document object and the client applications get information of the
original XML document by invoking methods on this document object.
DOM Parser has a tree based structure.
It supports both read and write operations and the API is very simple to
use.
It is preferred when random access to widely separated parts of a
document is required.
SAX: A SAX Parser implements SAX API. This API is an event based API
and less intuitive.
Features of SAX Parser:
It does not create any internal structure.
Clientsdoes not know what methods to call, they just overrides the
methods of the API and place his own code inside method.
It is an event based parser, it works like an event handler in Java.
It is simple and memory efficient.
It is very fast and works for huge documents.
BeautifulSoup library:
It is a python library which is used to perform web scraping and pulling
the data out of HTML and XML files.
It works with your favorite parser to provide idiomatic ways to
navigating, searching and modifying the parser tree.
It commonly saves programmers hours or days of work
request module:
The request module allows you to send HTTP requests using python.
It is used to abstracts the complexities of making requests behind a
beautiful, simple API so that you can focus on interacting with services
and consuming data in your application.
Now let us see how we perform HTML web scraping
from bs4 import BeautifulSoup
import requests
url="https://stmaryswomens.com/"
xml=requests.get(url)
xml.content
k=BeautifulSoup(xml.content)
print(k)
output:
It displays all the tags of the above mentioned HTML web page.
Let us also see how to work with XML web scraping.
from bs4 import BeautifulSoup
import requests
url="https://www.w3schools.com/xml/note.xml"
xml=requests.get(url)
xml.content
k=BeautifulSoup(xml.content)
t=k.find('note')
print(t.text)
OUTPUT
Tove
Jani
Reminder