Big data and effective
distributed computing
W h a t is Bigdata?
3 V Definition
Volume Velocity Variety
How much data generated in 1-Minite
➢4,166,667 post on facebook
➢3,47,222 tweets on twitter
➢1,736,111 post on instagram
➢300hr of video on youtube
➢18,327 cast the vote on REDDIT
W h a t a r e T y p e s o f D a t a?
S t r u c t u r e d Data:
➢ There are 3 AspectsDefined:
➢ Name
➢ Type
➢ Length
First_name Hire_date Salary
Gopichand 36/02/2013 $6000
Rahul 28/01/2013 10000
➢ Example: Hire_date date
➢ Example: Salary number(6,0)
➢ e.g. Oracle, SQL Server, DB2, Sybase, Teradata, MySQL (All RDBMS’s)
S e m i - S t r u c t u r e d Data:
➢Is this valid in Excel?
First_name Hire_date Salary
Gopichand 36/02/2013 $6000
Rahul 28/01/2013 10000
➢Is this valid in XML?
<row>
<name>Gopichand</name>
<hire_date>36/02/2013</ire_date> Example: Excel, XML,JSON, csv
<salary>$6000</salary>
</row>
<row>
<name>Rahul</name>
<hire_date>28/01/2013</hire_date>
<salary>10000</salary>
</row>
U n s t r u c t u r e d Data
➢ Free Flowing Text(Plain Text)
➢ Tweets on Twitter
➢ Messages on Whatsapp
➢ Emails on gmail, ymail etc.
➢ Word doc, pdfs, Comments on Facebook Posts
➢ Videos on Youtube
➢ Log files from Webservers, log files from app
servers, audit logs etc.
Can this be Processed by the Legacy
F r a m e w o r k s Av a i l a b l e ?
➢ Like ETLtools (ab Initio, Informatica,
Datastage etc.)
➢ Like RDBMS’s (Oracle, SQL Server, DB2,
Sybase, Teradata, MySQL)
➢ Mainframes(IBM mainframe, AS400 etc.)
Problem is: Client Server Architecture
How do Legacy Frameworks Work?
1. Client writes a select query
Select * 2. Sends to server
From emp
Where salary > 5000; Server
4. Sends the result to
client
3. Server processes data
W h a t is Scaling Problem?
Vertical Scaling Horizontal Scaling
W h a t H a p p e ns to these Databases w h e n
Data increases?
➢Select Queries would run very slow….This
increases time taken for reports to be
generated
➢Buffers/Cache Overflow can lead to a server
crash
Bigdata i n Telecom
Data:
➢ Incoming Calls Incoming Calls
➢ Outgoing Calls
Pimple
➢ Data Usage
Saudagar
➢ SMS
➢ VAS
➢ Call Detail Records
Deccan Kharadi
Think About!
➢ Total Activity
➢ Millions of Users
➢ Huge number of CDR Logs Generated
➢ Million Logs per Tower
➢ So many Towers in a City Hinjawadi Nal Stop
➢ Multiple Towers in the Country
Bigdata
Bigdata in Banking
Cheque Loans: Swipes: Social Media Spending
Net Banking ➢ Home Loans ➢ Credit Card Patterns
➢ Car Loans ➢ Debit Cards
➢ POS
➢ Gold Loans
Millions of Users…
➢ Investment Banking ➢ Mutual Funds
Millions of Logs…
➢ Retail Banking ➢ SIP
Millions of Rows…
➢ Corporate Banking ➢ EMIs
Gigabytes and Terabytes of Data
Bigdata in E-commerce
➢ Advertising and SEM(Search Engine Marketing)
Millions of Customers…
Millions of Clicks…
Clicks x Cost Per Click (CPC)
1 Click Rs. 50-500/-
Crores of Money
Click Based Logs
Millions and Millions of
Clickstream Logs…
Bigdata in Healthcare
Billing Medicines
Insurance Test
➢ Operator Information
Claims Reports
➢ Doctor/Nurse Information
➢ Medical Records
➢ Drug Information Room Old
Booking
➢ Emergency Work Log Reports
➢ In-Patient Record
➢ Out-Patient Record New
Hospital
Staff Reports
Every Hour…
Every Day…
Every Branch…
Every Hospital…
Gigabytes and Terabytes of Data
Android Apps
➢ When you install an Android
App, you authorize them to
read your messages, mails,
phonebook, and access your
media etc.
Te m p l e R u n a n d A d v e r t i s e m e n t
Advertisements shown here
H o w d o we solve t h i s p r o b l e m ?
Distributed Processing
Select Host
Terminal Terminal Terminal Terminal
Query Computer
Host Terminal Terminal Terminal Terminal
Computer
Code Host
Written Terminal Terminal Terminal Terminal
Computer
at one
place
Unlimited Scaling!!
Hadoop
History of Hodoop
2003 :- Google launch one file system for storing data. GFS (Google
File System)
2004 :- Google launch one software framework for processing data
called MapReduce.
2004:- Google publish one white paper on GFS & MapReduce
Yahoo work on white paper, publish by google and come out on
conclusion i.e. Hadoop
Who is the inventor of hadoop
Doug Cutting is inventor of hadoop
Doug Cutting small kids playing with toy elephant and the toy name is hadoop, so
this name was given to technology called as “hadoop