Fds Unit - III

The document discusses various file formats for data loading and storage in Python including text, CSV, JSON, XML, and HDF5 formats. It covers reading and writing data from these formats in Python along with interacting with databases and MongoDB. Specific methods like read_csv(), to_csv(), and JSON module functions are explained.

Uploaded by

jai gera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

192 views58 pages

Fds Unit - III

Uploaded by

jai gera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 58

FDS UNIT - III

SYLLABUS

 Data Loading, Storage, and File Formats : Reading and Writing Data in
Text Format, Reading Text Files in Pieces, Writing Data Out to Text
Format, Manually Working with Delimited Formats, JSON Data, XML and
HTML: Web Scraping, Binary Data Formats, Using HDF5 Format,
Reading Microsoft Excel Files, Interacting with Databases, Storing and
Loading Data in MongoDB
Reading and Writing of data in text formats
 Python become a beloved language for text and files because of its simple
syntax for interaction with files, data structure and it convenient features
like list,tuples, series, data frames, packing and unpacking.

 Here we are going to taking the support of CSV, TEXT, EXCEL, JSON
files to directly creating the data set instead of creating the data sets by
using data structures.

 CSV FILE: CSV file is delimited text file that uses commas to separate
the values
 Each line of the file is data record, each data record contains one or
more fields separated by commas.

 CSV stores the data in tabular format as a plane text ,various

software programs and applications can be import or export from the
CSV file.

 To directly import the CSV file into python environment, we need to

use a method called read_csv(). Similarly we can also use a method
called read_table() method.
 Firstly we have to create a MS EXCEL file in csv file format to create a
data set.
 import pandas as pd
 data = pd.read_csv(r’C:\Users\SREE\Desktop\fds prog\s1.csv')
 print(data)

 output:
CSE IT CSD
 0 501 1201 4401
 1 502 1202 4402
 2 503 1203 4403
 csv file format will never allow any beautifications to the current working
file.
 Now we will work with read_table.
 import pandas as pd
 data = pd.read_table(r’C:\Users\SREE\Desktop\fds prog\s1.csv')
 print(data)

CSE,IT,CSD
 0 501,1201,4401
 1 502,1202,4402
 2 503,1203,4403
 we have any attribute called sep which is used to remove the comma in the
read_table method.
 import pandas as pd
 data = pd.read_table(r'C:\Users\SREE\Desktop\fds prog\s1.csv',sep = ',')
 print(data)

 Output

CSE IT CSD
 0 501 1201 4401
 1 502 1202 4402
 2 503 1203 4403
 we also have an attribute called header which is gives the index to the
column names. Here we are going to assign the header with None to
provide default (or) integers.
 import pandas as pd
 data = pd.read_csv(r'C:\Users\SREE\Desktop\fds prog\s1.csv',header =
None)
 print(data)

 OUTPUT
0 1 2
 0 CSE IT CSD
 1 501 1201 4401
 2 502 1202 4402
 3 503 1203 4403
 We are having a special attribute to change the column name which is
called as names. We also have another to remove rows which is called
skiprows and we have to assign some value to remove a specified number
of rows.
 import pandas as pd
 data = pd.read_csv(r'C:\Users\SREE\Desktop\fds prog\s1.csv',names =
[“CSD",“DS"],skiprows = 1)
 print(data)
 OUTPUT

CSE IT DS
 0 501 1201 4401
 1 502 1202 4402
 2 503 1203 4403
 Major advantages of the use of reading and writing different formats
of files (CSV,JSON,TEXT,EXCEL…):
 User friendly interface(working directly with raw data).
 Easy retrievals.
 Permanently stored.
 Easy appending.
 Reading text file in pieces:
 While processing a large file and to figure out the right set of arguments
(or) rows, we want read a small pieces of the file through chunks.
 We have attribute like nrows and chunksize to reduce the large file into
smaller chunks for easy processing of the data.

 Now let us see how the nrows will work

 import pandas as pd
 data = pd.read_table(r'C:\Users\SREE\Desktop\fds prog\
s1.csv',nrows = 2)
 print(data)
 OUTPUT
 CSE,IT,CSD
 0 501,1201,4401
 1 502,1202,4402
 Writing data into the file (or) text format:
 Data can also be exported to delimited format. This can be done by using
to_csv() method.

 Here we need to create a empty csv file in the current working directory,
now let us consider the empty csv file as Book2.csv.
 import pandas as pd
 data = pd.DataFrame({'a':[2020,2021,2022,2023,2024],'b':
['JAN','FEB','MAR','APR','MAY'],'c':['MON','TUE','WED','THU','FRI']})
 data.to_csv(r'C:\Users\SREE\Desktop\fds prog\write.csv')
 import pandas as pd
 import sys
 data=pd.read_csv(r'C:\Users\SREE\Desktop\fds prog\s1.csv')
 print(data)
 data.to_csv(r'C:\Users\SREE\Desktop\fds prog\w4.csv')
 k = data.to_csv(sys.stdout)
 print(k)
 output:
a b c
 0 1 11 111
 1 2 22 222
 2 3 33 333
,a ,b ,c
 0,1,11,111
 1,2,22,222
 2,3,33,333
 None
 k1 = data.to_csv(sys.stdout,sep = '|')
 print(k1)
 output:
 |a|b|c
 0|1|11|111
 1|2|22|222
 2|3|33|333
 None
 Modification of the table:
 Now let us modify the existing table by opening Book3.csv and adding
another column with name ‘d’ and filling the it with data along with
NaN.
 import pandas as pd
 data=pd.read_csv(r'C:\Users\SREE\Desktop\fds prog\s1.csv')
 print(data)
 output:
a b c d
 0 1 11 111 NaN
 1 2 22 222 2222.0
 2 3 33 333 NaN
 Now we can manipulate the table by replacing NaN with NULL which
is done by using na_rep attribute which is available in to_csv() method.
 data.to_csv(sys.stdout,na_rep = "NULL")
 print(data)
 output:
 a b c d
 0 1 11 111 NaN
 1 2 22 222 2222.0
 2 3 33 333 NaN
 We can also change the NULL values with our required values.
 data.to_csv(sys.stdout,na_rep = 77)
 print(data)
 output:
 a b c d
 0 1 11 111 77
 1 2 22 222 2222.0
 2 3 33 333 77
 We can also alter the table by removing the indexing and header by using
index and header attributes which are available in to_csv() method.
 But to remove the index and header only by assigning the attribute with
‘False’.
 data.to_csv(sys.stdout, sep='|', na_rep=77, index=False, header=False)
 output:
 1|11|111|77
 2|22|222|2222.0
 3|33|333|3333.0
 Manually working delimited format:
 Normally we can also work manually to read (or) write the files without
using the in built methods like read_csv(),read_table(),to_csv().

 Now in this case we can use CSV module to work manually with
delimited files.

 To work with the CSV module first we need to import the module. Now
let us see how we should work manually with delimited files.
 import pandas as pd
 import csv
 f = open(r'C:\Users\SREE\Desktop\fds prog\s1.csv')
 out = csv.reader(f)
 for i in out:
 print(i)
 OUTPUT
['CSE', 'IT', 'CSD', 'AIDS']
['501', '1201', '4401', '5401']
['502', '1202', '4402', 'NaN']
['503', '1203', '4403', '5403']
 line = list(csv.reader(open(r'C:\Users\SREE\Desktop\fds prog\s1.csv')))
 print(line)
 Output
 [['CSE', 'IT', 'CSD', 'AIDS'], ['501', '1201', '4401', '5401'], ['502', '1202',
'4402', 'NaN'], ['503', '1203', '4403', '5403']]
JSON[JAVA SCRIPT OBJECT NOTATION]
 JSON (JavaScript Object Notation) is a lightweight data-interchange
format.
 It is easy for humans to read and write.
 JSON requires less memory to store the data.
 It is based on a subset of the JavaScript Programming Language Standard.
 The most commonly used message format for application irrespectively to
language and interpretability.
 JSON is very similar to python dict object and it contains a group of key-
value pairs.
 It has become one of the standard formats for sending data by HTTP
request between web browsers and other applications.
 JSON Module:
 Python has a built-in package called JSON, which can be used to
work with JSON data. It's done by using the JSON module.

 JSON module is to convert the JSON form to python dict object and
python dict object to JSON form.
 Serialization:
 The process of converting an object from python supported form to
either file supported (or) network supported form.
 It is also defined as
 For serialization purpose we use dump() and dumps() methods which
are available in json module.
 dump():The dump() method is used when the Python objects have to be
stored in a file.
 dumps():The dumps() is used when the objects are required to be in string
format and is used for parsing, printing, etc, .
 import json
 sdetail = { 'sname':'kumar',
 'age':21,
 'per':94.67,
 'ismarried':False,
 'have':None}
 injson = json.dumps(sdetail)
 print(injson)
 Output:
 {"sname": "kumar", "age": 21, "per": 94.67, "ismarried": false,
"have": null}
 import json
 sdetail = { 'sname':'kumar',
 'age':21,
 'per':94.67,
 'ismarried':False,
 'have':None}
 injson = json.dumps(sdetail,indent = 4)
 print(injson)
 OUTPUT
 {
 "sname": "kumar",
 "age": 21,
 "per": 94.67,
 "ismarried": false,
 "have": null
 }
 To create a JSON file with python dict object we use dump() method
which is available in json module
 import json
 sdetail = { 'sname':'kumar',
 'age':21,
 'per':94.67,
 'ismarried':False,
 'have':None}
 injson = json.dumps(sdetail,indent = 4)
 with open('studentinfo.json','w') as f:
 json.dump(sdetail,f,indent = 4)
 Deserialization:
 The process of converting an object from either file support form (or)
network supported form to python supported form.
 It is also defined as the process of decoding the data that is in JSON
format into native data type.
 For deserialization purpose we use loads() and load() which are
available in json module.
 loads():The loads() method is used to convert json string to python dict
object.
 load():The load() method is used when we need to read a JSON format
from a file and convert into python dict object.
 import json
 jobj='''{"sname":"kumar", "age":20, "per":80.2, "ismarried":false,
"havegf":null}'''
 inpyth=json.loads(jobj)
 print(inpyth)
 print('Student name::', inpyth['sname'])
 print('Student age::', inpyth['age'])
 print('Student percentage::', inpyth['per'])
 print('Is married::', inpyth['ismarried'])
 print('have Girl Friend::', inpyth['havegf'])
 OUTPUT
 {'sname': 'kumar', 'age': 20, 'per': 80.2, 'ismarried': False, 'havegf': None}
 Student name:: kumar
 Student age:: 20
 Student percentage:: 80.2
 Is married:: False
 have Girl Friend:: None
 Web Scraping: Web scraping is the process of gathering information
from the Internet that means huge data.
 XML Parser: XML parser is a software library or a package that
provides interface for client applications to work with XML documents. It
checks for proper format of the XML document and may also validate the
XML documents.
 There are two types of XML parser-
 Dom Parser − Parses an XML document by loading the complete
contents of the document and creating its complete hierarchical tree in
memory.
 SAX Parser − Parses an XML document on event-based triggers.
Does not load the complete document into the memory.
 DOM: A DOM document is an object which contains all the information
of an XML document. It is composed like a tree structure. The DOM Parser
implements a DOM API. This API is very simple to use.
 Features of DOM Parser:
 A DOM Parser creates an internal structure in memory which is a DOM
document object and the client applications get information of the
original XML document by invoking methods on this document object.
 DOM Parser has a tree based structure.
 It supports both read and write operations and the API is very simple to
use.
 It is preferred when random access to widely separated parts of a
document is required.
 SAX: A SAX Parser implements SAX API. This API is an event based API
and less intuitive.
 Features of SAX Parser:
 It does not create any internal structure.
 Clientsdoes not know what methods to call, they just overrides the
methods of the API and place his own code inside method.
 It is an event based parser, it works like an event handler in Java.
 It is simple and memory efficient.
 It is very fast and works for huge documents.
 BeautifulSoup library:
 It is a python library which is used to perform web scraping and pulling
the data out of HTML and XML files.
 It works with your favorite parser to provide idiomatic ways to
navigating, searching and modifying the parser tree.
 It commonly saves programmers hours or days of work
 request module:
 The request module allows you to send HTTP requests using python.
 It is used to abstracts the complexities of making requests behind a
beautiful, simple API so that you can focus on interacting with services
and consuming data in your application.
 Now let us see how we perform HTML web scraping
 from bs4 import BeautifulSoup
 import requests
 url="https://stmaryswomens.com/"
 xml=requests.get(url)
 xml.content
 k=BeautifulSoup(xml.content)
 print(k)
 output:
 It displays all the tags of the above mentioned HTML web page.
 Let us also see how to work with XML web scraping.
 from bs4 import BeautifulSoup
 import requests
 url="https://www.w3schools.com/xml/note.xml"
 xml=requests.get(url)
 xml.content
 k=BeautifulSoup(xml.content)
 t=k.find('note')
 print(t.text)
 OUTPUT
 Tove

 Jani

 Reminder

 Don't forget me this weekend!

 Binary data format:
 We have many ways to store the data in binary format but one of the
easiest way is using the python in built pickle serialization.
 Using the pickle serialization the data is stored efficiently in the binary
format.
 Generally pandas has conventional object like
load(),loads(),dump(),dumps() to read and write the data from the disk.
 Pickle:
 The pickle module implements binary protocols for serializing and de-
serializing a Python object structure.
 Pickling is the process whereby a Python object hierarchy is converted
into a byte stream.
 Python objects like list, dict are converted into “character stream”.
 The ideal for character stream is, it contains all the information that is
necessary to reconstruct the object in another python’s script.
 HDF5 format:
 HDF5 file stands for Hierarchical Data Format 5.
 It is an open-source file which comes in handy to store large amount of data.
 It stores data in a hierarchical structure within a single file. So if we want to
quickly access a particular part of the file rather than the whole file
 One important feature is that it can attach metadata to every data in the file thus
provides powerful searching and accessing.
 There are many numbers of tools that facilitate efficient reading and writing a
large amount data in binary format in disk.
 There two interfaces to the HDF5 library in python is PyTable and h5Py.
 In the above two interface each take a different approaches to solve the
problems.
 h5Py:
 The h5py package is a Pythonic interface to the HDF5 binary data
format.
 h5py provides easy-to-use high level interface, which allows you to store
huge amounts of numerical data.
 Easily manipulate that data from NumPy. H5py uses straightforward
NumPy and Python metaphors, like dictionary and NumPy array syntax.
 H5py provide direct and high level interface to the HDF5 API
 PyTable:
 PyTables is a Python package for storing and querying large tabular
datasets in an efficient way.
 PyTables is built on top of the HDF5 library and the NumPy and numexpr
packages.
 These provide the foundations for very compact storage and high
performance data management.
 Interacting with database (MongoDB):
 MongoDB tutorial provides basic and advanced concepts of SQL.
 It is the most popular and trending database now-a-days.
 Even though we are having a lot of database software like
oracle,Mysql,DB2,Sybase, IBMDB,….. from different database venders
like Oracle Cooperation, IBM, Microsoft etc..
 MongoDB is derived from the word “HUMONGOUS” which means
Extremely very large.
 MongoDB is an cloud based database.
 MongoDB is not only the database but it is also the company i.e. vender
of MongoDB.
 MongoDB also provides a lot of software programs.
 Reason for the use of MongoDB:
 We can use the MongoDB where we have the possibility of developing of
an application that which requires a database.
 As a document database, MongoDB makes it easy for developers to store
structured or unstructured data. It uses a JSON-like format to store
documents.
 MongoDB can also handle high volume and can scale both vertically or
horizontally to accommodate large data loads.
 MongoDB has become one of the most wanted databases in the world
because it makes it easy for developers to store, manage, and retrieve data
when creating applications with most programming languages.
 We can use the MongoDB in desktop application, client server
application, enterprise application and also in web application.
 Working of the MongoDB:
 MongoDB is a database server and the data is stored in these database.
 MongoDB environment gives a server that can start and then create
multiple databases on it using MongoDB. Because of its NOSQL database,
the data is stored in the collection and documents.
 Now let us see the relation between database, collection and documents
 MongoDB contains ‘N’ number of database. And each database contains
‘N’ number of collections. Finally each collection consists of ‘N’ number of
documents.
 Data representation in MongoDB:
 The data is represented in JSON format but internally the data is stored in
BSON
 BSON is the binary encoded java script object notation.
 Here the BSON is a textual object notation widely used to transmit and
store the data across web application. And the JSON is easier to
understand as it is human-readable, but compared to BSON, it supports
fewer data types.
 BSON have additional data type support like numberInt, objectID, …
 When the BSON is again converted into JSON we have a problem with
additional supported data types.
 To solve this problem we get the output in EJSON(Extended JavaScript
Object Notation).
 In the MongoDB all the conversions are done by the database engine
which is included in the MongoDB.
 The biggest advantage of the MongoDB is of fast accessing of the data in
the documents.
 Key characteristics of MongoDB:
 All the information is available in the single document, it doesn’t require
join operation, it is fast retrieval.
 Documents are independent to each other and no scheme. Hence we can
store unstructured like videos, audio, images…
 We can perform operation like editing, deleting and inserting documents
very easily.
 Retrieval data is in the form of JSON which can be understandable by any
programming language without any conversions.
 We can store very huge amount of data and hence scalability is more.
 The performance and flexibility is the biggest asset of MongoDB.

1 Pandas Basics
No ratings yet
1 Pandas Basics
13 pages
Pandas Notes Basic To Advance
No ratings yet
Pandas Notes Basic To Advance
21 pages
DSL Pandas
No ratings yet
DSL Pandas
87 pages
Unit 4
No ratings yet
Unit 4
60 pages
Pandas
No ratings yet
Pandas
13 pages
40_NumPy_and_Pandas_interview_questions_with_answers_1740141557
No ratings yet
40_NumPy_and_Pandas_interview_questions_with_answers_1740141557
6 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
Pandas
No ratings yet
Pandas
8 pages
Pandas Class XII (2021-22)
No ratings yet
Pandas Class XII (2021-22)
246 pages
Unit II Visualizing Using Matplotlib
No ratings yet
Unit II Visualizing Using Matplotlib
24 pages
Assignment 61
100% (2)
Assignment 61
4 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
DS Unit 3 Part 1
No ratings yet
DS Unit 3 Part 1
27 pages
Unit 5
No ratings yet
Unit 5
27 pages
Week1.py: Write A Program To Demonstrate Different Number Data Types in Python
83% (29)
Week1.py: Write A Program To Demonstrate Different Number Data Types in Python
17 pages
Panda Cheatsheet
No ratings yet
Panda Cheatsheet
17 pages
Pandas Notes
No ratings yet
Pandas Notes
4 pages
Unit-1 Python Pandas (1)
No ratings yet
Unit-1 Python Pandas (1)
56 pages
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
100% (1)
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
27 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
Top 50 Pandas Interview Questions and Answers (2024)
No ratings yet
Top 50 Pandas Interview Questions and Answers (2024)
34 pages
UNIT-5 Data Visualization Using Dataframe
No ratings yet
UNIT-5 Data Visualization Using Dataframe
38 pages
Pandas in Python 16sept2022
No ratings yet
Pandas in Python 16sept2022
8 pages
XII-IP - Data Visualisation
No ratings yet
XII-IP - Data Visualisation
65 pages
Informatics Practices Practical List22-2323
100% (1)
Informatics Practices Practical List22-2323
7 pages
EDA with Pandas
No ratings yet
EDA with Pandas
8 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Module 5 - 2
No ratings yet
Module 5 - 2
1 page
Programming and Data Analytics Using Python
100% (1)
Programming and Data Analytics Using Python
16 pages
IPL DATA ANLYSIS (1)
No ratings yet
IPL DATA ANLYSIS (1)
20 pages
IV Unit Fds
No ratings yet
IV Unit Fds
16 pages
Research Paper Presentation Pandas Moshiul Arefin
No ratings yet
Research Paper Presentation Pandas Moshiul Arefin
30 pages
Slicing. Both, Numpy Array Indexing and Slicing Will Be Discussed in The Remainder
No ratings yet
Slicing. Both, Numpy Array Indexing and Slicing Will Be Discussed in The Remainder
50 pages
CLS - Xii - Ip - Practical & Project - 2022-23
No ratings yet
CLS - Xii - Ip - Practical & Project - 2022-23
6 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
IP TERM-1 Study Material (Session 2021-22)
No ratings yet
IP TERM-1 Study Material (Session 2021-22)
84 pages
Unit 1
100% (1)
Unit 1
69 pages
Block 1-Data Handling Using Pandas DataFrame
No ratings yet
Block 1-Data Handling Using Pandas DataFrame
17 pages
Data Science - Unit-3-Part-2
No ratings yet
Data Science - Unit-3-Part-2
32 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Unit V Data Visualization
No ratings yet
Unit V Data Visualization
49 pages
NumPy Notes
No ratings yet
NumPy Notes
13 pages
12 Ip
No ratings yet
12 Ip
5 pages
1745516832930-Pandas-Handbook
No ratings yet
1745516832930-Pandas-Handbook
33 pages
Python Operators
No ratings yet
Python Operators
7 pages
Pandas
No ratings yet
Pandas
41 pages
Data Visualization
No ratings yet
Data Visualization
9 pages
Python Interview Questions
No ratings yet
Python Interview Questions
8 pages
Project
No ratings yet
Project
18 pages
Python Revision Material - CH.1,2.3.5.9
No ratings yet
Python Revision Material - CH.1,2.3.5.9
31 pages
Tuple in Python PDF
No ratings yet
Tuple in Python PDF
20 pages
LMRS Ip 2020 21
No ratings yet
LMRS Ip 2020 21
21 pages
Python Generators: How To Create A Generator in Python?
No ratings yet
Python Generators: How To Create A Generator in Python?
8 pages
Experiment 1: Write A Python Program To Find Sum of Series (1+ (1+2) + (1+2+3) +-+ (1+2+3+ - +N) )
No ratings yet
Experiment 1: Write A Python Program To Find Sum of Series (1+ (1+2) + (1+2+3) +-+ (1+2+3+ - +N) )
20 pages
Python Date Time
No ratings yet
Python Date Time
6 pages
Introduction To Data Visualization in Python
No ratings yet
Introduction To Data Visualization in Python
16 pages
Unit 2: 1. Mathematical Foundations and Learning Mechanisms
No ratings yet
Unit 2: 1. Mathematical Foundations and Learning Mechanisms
26 pages
Chapter 4 File Handlinf Final (New)
100% (1)
Chapter 4 File Handlinf Final (New)
78 pages
ST Marys Group of Institutions Guntur: A. B. C. D
No ratings yet
ST Marys Group of Institutions Guntur: A. B. C. D
3 pages
Python Libraries Cheat Sheets
No ratings yet
Python Libraries Cheat Sheets
6 pages
Pandas Guide
No ratings yet
Pandas Guide
64 pages
EMPLOYEE DATA ANALYSIS SYSTEM (IP CLASS XII)
No ratings yet
EMPLOYEE DATA ANALYSIS SYSTEM (IP CLASS XII)
26 pages
Unit 1: 1. Introduction To Artificial Neural Network
No ratings yet
Unit 1: 1. Introduction To Artificial Neural Network
17 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet