Lab Manual BDA

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 36

Practical List

Subject Name: Big Data and Analytics


Faculty Name: Prof. Krunal Pimple Year: 2018-19
Class: B.E

Sr. No Name of Experiment

1 To study Hadoop Ecosystem

2 To study the use of MapReduce.

3 Case study on the No SQL system.

4 To study the Mining of the Data Streams.

5 Case study on frequent item set.

6 Case study on Recommendation System.

7 Case Study on the Big Data Solutions.

8 Case study on the Mining Network as graph.

Subject In charge

Prof. Krunal pimple


Experiment no.1

Aim: case study on Hadoop Ecosystem

Fig:1.1
Big Data is the buzz word circulating in IT industry from 2008. The amount of data being
generated by social networks, manufacturing, retail, stocks, telecom, insurance, banking, and
health care industries is way beyond our imaginations.
Before the advent of Hadoop, storage and processing of big data was a big challenge. But now
that Hadoop is available, companies have realized the business impact of Big Data and how
understanding this data will drive the growth. For example:
• Banking sectors have a better chance to understand loyal customers, loan defaulters and fraud
transactions.
• Retail sectors now have enough data to forecast demand.
• Manufacturing sectors need not depend on the costly mechanisms for quality testing. Capturing
sensors data and analyzing it would reveal many patterns.

• E-Commerce, social networks can personalize the pages based on customer interests.
• Stock markets generate humongous amount of data, correlating from time to time will reveal
beautiful insights.
Big Data has many useful and insightful applications:
Hadoop is the straight answer for processing Big Data. Hadoop ecosystem is a combination of
technologies which have proficient advantage in solving business problems.
Let us understand the components in Hadoop Ecosytem to build right solutions.
Core Hadoop:
HDFS:
HDFS stands for Hadoop Distributed File System for managing big data sets with High Volume,
Velocity and Variety. HDFS implements master slave architecture. Master is Name node and
slave is data node.
Features:
• Scalable
• Reliable
• Commodity Hardware
HDFS is the well known for Big Data storage.
Map Reduce:
Map Reduce is a programming model designed to process high volume distributed data. Platform
is built using Java for better exception handling. Map Reduce includes two deamons, Job tracker
and Task Tracker.
Features:
• Functional Programming.
• Works very well on Big Data.
• Can process large datasets.
Map Reduce is the main component known for processing big data.
Data Access:
Pig:
Apache Pig is a high level language built on top of MapReduce for analyzing large datasets with
simple adhoc data analysis programs. Pig is also known as Data Flow language. It is very well
integrated with python. It is initially developed by yahoo.
Salient features of pig:
• Ease of programming
• Optimization opportunities
• Extensibility.
Pig scripts internally will be converted to map reduce programs.
Hive:
Apache Hive is another high level query language and data warehouse infrastructure built on top
of Hadoop for providing data summarization, query and analysis. It is initially developed by
yahoo and made open source.
Salient features of hive:
• SQL like query language called HQL.
• Partitioning and bucketing for faster data processing.
• Integration with visualization tools like Tableau.
Hive queries internally will be converted to map reduce programs.
If you want to become a big data analyst, these two high level languages are a must know!!
Data Storage:
Hbase:
Apache HBase is a NoSQL database built for hosting large tables with billions of rows and
millions of columns on top of Hadoop commodity hardware machines. Use Apache Hbase when
you need random, realtime read/write access to your Big Data.
Features:
• Strictly consistent reads and writes. In memory operations.
• Easy to use Java API for client access.
• Well integrated with pig, hive and sqoop.
• Is a consistent and partition tolerant system in CAP theorem.
Cassandra:
Cassandra is a NoSQL database designed for linear scalability and high availability. Cassandra is
based on key-value model. Developed by Facebook and known for faster response to queries.
Features:
• Column indexes
• Support for de-normalization
• Materialized views
• Powerful built-in caching.

Data Serialization:
Avro:
Apache Avro is a data serialization framework which is language neutral. Designed for language
portability, allowing data to potentially outlive the language to read and write it.
Data Integration:
Apache Sqoop:
Apache Sqoop is a tool designed for bulk data transfers between relational databases and
Hadoop.
Features:
• Import and export to and from HDFS.
• Import and export to and from Hive.

• Import and export to HBase.


Management, Monitoring and Orchestration:
Apache Zookeeper:
Zookeeper is a centralized service designed for maintaining configuration information, naming,
providing distributed synchronization, and providing group services.
Features:
• Serialization
• Atomicity
• Reliability
• Simple API
Conclusion:

Experiment no.2

Aim: Case Study on MapReduce

Introduction:

MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers is
sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.

MapReduce Algorithm:
 Generally MapReduce paradigm is based on sending the computer to where the data
resides!

 MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.

o Map stage : The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.

o Reduce stage : This stage is the combination of the Shufflestage and


the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.

 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.

 The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.

 Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.

 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Map Reduce Example:

Mapper:
(MAintain, Prepare, and Produce Executive Reports), also referred to by diehard Mapper
programmers as (Most Amazing Programming Product Ever Released), is a database
management and processing system. It is a software tool that enables end-users to share
computer power in a corporation. Users are able to develop their own applications and process
them interactively. The product has a number of unique characteristics that may appear
technically impossible to persons unfamiliar with its method of operation.
MAPPER had its origins outside the technical mainstream of computer programming. Initially
created in Sperry Univac's computer factory in Roseville, Minnesota, MAPPER is a proprietary
product of Unisys Corporation. The principal architect was Louis S. Schlueter, who worked with
other early "gurus" like Michael Stroeing of Sperry, Steve Anderson of Santa Fe Railway, and
later with Kansas City government. MAPPER became a very popular solution in the Government
sector. Large user group communities formed like the New England MAPPER Users Group
headed by Judith Hartman, Howard Roundy and Michael Scolastico.
There are similarities between the development history of MAPPER and that of UNIX. Both
were responses to what appeared to be unreasonable restrictions in the way computer systems
were developed. While UNIX (and later Linux) is a core software "operating" system, MAPPER
is a cross-platform application tool. Thus it will run, like a program, on a variety of operating
systems.

Reducer:

 Reducer reduces a set of intermediate values which share a key to a smaller set of
values.
 The number of reduces for the job is set by the user via
JobConf.setNumReduceTasks(int).
 Overall, Reducer implementations are passed the JobConf for the job via the
JobConfigurable.configure(JobConf) method and can override it to initialize
themselves. The framework then calls reduce(WritableComparable, Iterator,
OutputCollector, Reporter) method for each <key, (list of values)> pair in the grouped
inputs. Applications can then override the Closeable.close() method to perform any
required cleanup.
 Reducer has 3 primary phases: shuffle, sort and reduce.
Job Tracker:

1. JobTracker process runs on a separate node and not usually on a DataNode.


2. JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced by
ResourceManager/ApplicationMaster in MRv2.

3. JobTracker receives the requests for MapReduce execution from the client.

4. JobTracker talks to the NameNode to determine the location of the data.

5. JobTracker finds the best TaskTracker nodes to execute tasks based on the data locality
(proximity of the data) and the available slots to execute a task on a given node.

6. JobTracker monitors the individual TaskTrackers and the submits back the overall status of
the job back to the client.
7. JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution.

8. When the JobTracker is down, HDFS will still be functional but the MapReduce execution
can not be started and the existing MapReduce jobs will be halted.

Conclusion:
Experiment no.3

Aim: Case study on NoSQL

NoSQL

A NoSQL (originally referring to "non SQL", "non-relational" or "not only SQL") database
provides a mechanism for storage and retrieval of data which is modeled in means other than the
tabular relations used in relational databases. Such databases have existed since the late 1960s,
but did not obtain the "NoSQL" moniker until a surge of popularity in the early twenty-first
century, triggered by the needs of Web 2.0 companies such as Facebook, Google, and
Amazon.com. NoSQL databases are increasingly used in big data and real-time web
applications. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they
may support SQL-like query languages.

Fig: 8.1

Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of
availability, partition tolerance, and speed. Barriers to the greater adoption of NoSQL stores
include the use of low-level query languages (instead of SQL, for instance the lack of ability to
perform ad-hoc JOINs across tables), lack of standardized interfaces, and huge previous
investments in existing relational databases. Most NoSQL stores lack true ACID transactions,
although a few databases, such as MarkLogic, Aerospike, FairCom c-treeACE, Google Spanner
(though technically a NewSQL database), Symas LMDB, and OrientDB have made them central
to their designs.

Types and examples of NoSQL databases

There have been various approaches to classify NoSQL databases, each with different categories
and subcategories, some of which overlap. What follows is a basic classification by data model,
with examples:
 Column: A column family is a NoSQL object that contains columns of related data. It is
a tuple (pair) that consists of a key-value pair, where the key is mapped to a value that is
a set of columns. In analogy with relational databases, a column family is as a "table",
each key-value pair being a "row". Each column is a tuple (triplet) consisting of a column
name, a value, and a timestamp. In a relational database table, this data would be grouped
together within a table with other non-related data.

Example:Accumulo, Cassandra, Druid, HBase, Vertica, SAP HANA

 Document: A document-oriented database, or document store, is a computer program


designed for storing, retrieving and managing document-oriented information, also
known as semi-structured data. Document-oriented databases are one of the main
categories of NoSQL databases, and the popularity of the term "document-oriented
database" has grown[1] with the use of the term NoSQL itself. XML databases are a
subclass of document-oriented databases that are optimized to work with XML
documents. Graph databases are similar, but add another layer, the relationship, which
allows them to link documents for rapid traversal

Example:Apache CouchDB, Clusterpoint, Couchbase, DocumentDB, HyperDex, IBM


Domino, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB.

 Key-value: A key-value store, or key-value database, is a data storage paradigm designed


for storing, retrieving, and managing associative arrays, a data structure more commonly
known today as a dictionary or hash. Dictionaries contain a collection of objects, or
records, which in turn have many different fields within them, each containing data.
These records are stored and retrieved using a key that uniquely identifies the record, and
is used to quickly find the data within the database.

Example:Aerospike, Couchbase, Dynamo, FairCom c-treeACE, FoundationDB,


HyperDex, MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis, Riak,
Berkeley DB.

 Graph: In computing, a graph database is a database that uses graph structures for
semantic queries with nodes, edges and properties to represent and store data. A key
concept of the system is the graph (or edge or relationship), which directly relates data
items in the store. The relationships allow data in the store to be linked together directly,
and in many cases retrieved with a single operation.

Example: AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic, Neo4J,


OrientDB, Virtuoso, Stardog.

Multi-model: Most database management systems are organized around a single data
model that determines how data can be organized, stored, and manipulated. In contrast, a
multi-model database is designed to support multiple data models against a single,
integrated backend. Document, graph, relational, and key-value models are examples of
data models that may be supported by a multi-model database.

Examples:Alchemy Database, ArangoDB, CortexDB, Couchbase, FoundationDB,


MarkLogic, OrientDB.
Handling relational data

Since most NoSQL databases lack ability for joins in queries, the database schema generally
needs to be designed differently. There are three main techniques for handling relational data in a
NoSQL database. (See table Join and ACID Support for NoSQL databases that support joins.)

Multiple queries:Instead of retrieving all the data with one query, it's common to do several
queries to get the desired data. NoSQL queries are often faster than traditional SQL queries so
the cost of having to do additional queries may be acceptable. If an excessive number of queries
would be necessary, one of the other two approaches is more appropriate.

Caching/replication/non-normalized data:Instead of only storing foreign keys, it's common to


store actual foreign values along with the model's data. For example, each blog comment might
include the username in addition to a user id, thus providing easy access to the username without
requiring another lookup. When a username changes however, this will now need to be changed
in many places in the database. Thus this approach works better when reads are much more
common than writes.

Nesting data:With document databases like MongoDB it's common to put more data in a
smaller number of collections. For example, in a blogging application, one might choose to store
comments within the blog post document so that with a single retrieval one gets all the
comments. Thus in this approach a single document contains all the data you need for a specific
task.
Casandra: The Apache Cassandra database is the right choice when you need scalability and
high availability without compromising performance. Linear scalability and proven fault-
tolerance on commodity hardware or cloud infrastructure make it the perfect platform for
mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-
class, providing lower latency for your users and the peace of mind of knowing that you can
survive regional outages.

Fig: 8.2

MongoDB: (from humongous) is a free and open-source cross-platform document-oriented


database program. Classified as a NoSQL database program, MongoDB uses JSON-like
documents with schemas. MongoDB is developed by MongoDB Inc. and is free and open-
source, published under a combination of the GNU Affero General Public License and the
Apache License.
Fig: 8.3

CouchDB: is open source database software that focuses on ease of use and having an
architecture that "completely embraces the Web". It has a document-oriented NoSQL database
architecture and is implemented in the concurrency-oriented language.

Fig: 8.4

Redis: is a software project that implements data structure servers. It is open-source, networked,
in-memory, and stores keys with optional durability.

Fig: 8.5

Riak: (pronounced "ree-ack") is a distributed NoSQL key-value data store that offers high
availability, fault tolerance, operational simplicity, and scalability. In addition to the open-source
version, it comes in a supported enterprise version and a cloud storage version. Riak implements
the principles from Amazon's Dynamo paper with heavy influence from the CAP Theorem.
Written in Erlang, Riak has fault tolerance data replication and automatic data distribution across
the cluster for performance and resilience.
Fig: 8.6

Conclusion:
Experiment no.4

Aim: Case Study on Mining Data Stream.

Introduction:

Data streaming is the transfer of data at a steady high-speed rate sufficient to support such
applications as high-definition television (HDTV) or the continuous backup copying to a storage
medium of the data flow within a computer. Data streaming requires some combination
of bandwidth sufficiency and, for real-time human perception of the data, the ability to make
sure that enough data is being continuously received without any noticeable time lag.

Fig:4.1

Data Stream Management System:

A Data stream management system (DSMS) is a computer program to manage continuous data
streams. It is similar to a database management system (DBMS), which is, however, designed for
static data in conventional databases. A DSMS also offers a flexible query processing so that the
information need can be expressed using queries. However, in contrast to a DBMS, a DSMS
executes a continuous query that is not only performed once, but is permanently installed.
Therefore, the query is continuously executed until it is explicitly uninstalled. Since most DSMS
are data-driven, a continuous query produces new results as long as new data arrive at the
system. This basic concept is similar to Complex event processing so that both technologies are
partially coalescing.

Fig:4.2. Data Stream Management system

Big Data Streaming:


Big data streaming is a process in which big data is quickly processed in order to extract real-
time insights from it. The data on which processing is done is the data in motion. Big data
streaming is ideally a speed-focused approach wherein a continuous stream of data is processed.

Fig:4.3
Sampling Data in a Stream:
As our first example of managing streaming data, we shall look at extractingreliable samples
from a stream. As with many stream algorithms, the “trick”involves using hashing in a somewhat
unusual way.

A Motivating Example:

The general problem we shall address is selecting a subset of a stream so that we can ask queries
about the selected subset and have the answers be statistically representative of the stream as a
whole. If we know what queries are to be asked, then there are a number of methods that might
work, but we are looking for a technique that will allow ad-hoc queries on the sample. We shall
look at a particular problem, from which the general idea will emerge.

Our running example is the following. A search engine receives a stream of queries, and it would
like to study the behavior of typical users.1 We assume the stream consists of tuples (user, query,
time). Suppose that we want to answer queries such as “What fraction of the typical user’s
queries were repeated overthe past month?” Assume also that we wish to store only 1/10th of the
streamelements.

The obvious approach would be to generate a random number, say an integer from 0 to 9, in
response to each search query. Store the tuple if and only if the random number is 0. If we do so,
each user has, on average, 1/10th of their queries stored. Statistical fluctuations will introduce
some noise into the data, but if users issue many queries, the law of large numbers will assure us
that most users will have a fraction quite close to 1/10th of their queries stored.

Filtering Streams:

Another common process on streams is selection, or filtering. We want to accept those tuples in
the stream that meet a criterion. Accepted tuples are passed to another process as a stream, while
other tuples are dropped. If the selection criterion is a property of the tuple that can be calculated
(e.g., the first component is less than 10), then the selection is easy to do. The problem becomes
harder when the criterion involves lookup for membership in a set. It is especially hard, when
that set is too large to store in main memory. In this section, we shall discuss the technique
known as “Bloom filtering” as a way to eliminate most of the tuples that do not meet the
criterion.

A Motivating Example:

Again let us start with a running example that illustrates the problem and what we can do about
it. Suppose we have a set S of one billion allowed email addresses – those that we will allow
through because we believe them not to be spam. The stream consists of pairs: an email address
and the email itself. Since the typical email address is 20 bytes or more, it is not reasonable to
store S in main memory. Thus, we can either use disk accesses to determine whether or not to let
through any given stream element, or we can devise a method that requires no more main
memory than we have available, and yet will filter mostof the undesired stream
elements.Suppose for argument’s sake that we have one gigabyte of available main memory. In
the technique known as Bloom filtering, we use that main memory as a bit array. In this case, we
have room for eight billion bits, since one byte equals eight bits. Devise a hash function h from
email addresses to eight billion buckets. Hash each member of S to a bit, and set that bit to 1. All
other bits of the array remain 0.Since there are one billion members of S, approximately 1/8th of
the bitswill be 1. The exact fraction of bits set to 1 will be slightly less than 1/8th, because it is
possible that two members of S hash to the same bit.

Decaying Windows:

We have assumed that a sliding window held a certain tail of the stream, either the most recent N
elements for fixed N, or all the elements that arrived after some time in the past. Sometimes we
do not want to make a sharp distinction between recent elements and those in the distant past, but
want to weightthe recent elements more heavily. In this section, we consider
“exponentiallydecaying windows,” and an application where they are quite useful: finding the
most common “recent” elements.

Definition of the Decaying Window:

An alternative approach is to redefine the question so that we are not asking for a count of 1’s in
a window. Rather, let us compute a smooth aggregation of all the 1’s ever seen in the stream,
with decaying weights, so the further back in the stream, the less weight is given. Formally, let a
stream currently consist of the elements a1, a2. . . at, where a1 is the first element to arrive and at
is the current element. Let c be a small constant, such as 10−6 or 10−9. Definethe exponentially
decaying window for this stream to be the sum
𝑡−𝑙

∑ at−i (1 − 𝑐)i
𝑖=0

The effect of this definition is to spread out the weights of the stream elements as far back in time
as the stream goes. In contrast, a fixed windowwith the same sum of the weights, 1/c, would put
equal weight 1 on each of the most recent 1/c elements to arrive and weight 0 on all previous
elements.

Fig:4.4A decaying window and a fixed-length window of equal weight

t is much easier to adjust the sum in an exponentially decaying window than in a sliding window
of fixed length. In the sliding window, we have to worry about the element that falls out of the
window each time a new element arrives. That forces us to keep the exact elements along with
the sum, or to usean approximation scheme such as DGIM. However, when a new element at+1
arrives at the stream input, all we need to do is:

1. Multiply the current sum by 1 − c.


2. Add at+1.

The reason this method works is that each of the previous elements has now moved one position
further from the current element, so its weight is multiplied by 1 − c. Further, the weight on the
current element is (1 − c) 0 = 1, so addingat+1 is the correct way to include the new element’s
contribution.

Conclusion:
Experiment no.5
Aim: Case Study on Frequent Itemset Algorithm

Introduction:

Frequent Itemset :

A frequent itemset is an itemset whose support is greater than some user-specified minimum
support (denoted Lk, where k is the size of the itemset) A candidate itemsetis a
potentially frequent itemset (denoted Ck, where k is the size of the itemset).

Market-Basket Model:

The market-basket model of data is used to describe a common form of manymany relationship
between two kinds of objects. On the one hand, we have items, and on the other we have baskets,
sometimes called “transactions.” Each basket consists of a set of items (an itemset), and usually
we assume that the number of items in a basket is small – much smaller than the total number of
items. The number of baskets is usually assumed to be very large, bigger than what can fit in
main memory. The data is assumed to be represented in a file consisting of a sequence of
baskets.

Example:

Each set is a basket, and the words are items. We took these sets by Googling cat dog and taking
snippets from the highest-ranked pages. Do not be concerned if a word appears twice in a basket,
as baskets are sets, and in principle items can appear only once.

Also, ignore capitalization.

1. {Cat, and, dog, bites}

2. {Yahoo, news, claims, a, cat, mated, with, a, dog, and, produced, viable, offspring}

3. {Cat, killer, likely, is, a, big, dog}

4. {Professional, free, advice, on, dog, training, puppy, training}

5. {Cat, and, kitten, training, and, behaviour}

6. {Dog, &, Cat, provides, dog, training, in, Eugene, Oregon}

7. {“Dog, and, cat”, is, a, slang, term, used, by, police, officers, for, a, male–female, relationship}

8. {Shop, for, your, show, dog, grooming, and, pet, supplies}

Frequent
 Itemsets
 Mining:


Minimum
 support
 level
 50%


– {A}, {B}, {C}, {A, B}, {A, C}

Application of Frequent Itemsets:

The original application of the market-basket model was in the analysis of true market baskets.
That is, supermarkets and chain stores record the contents of every market basket (physical
shopping cart) brought to the register for checkout. Here the “items” are the different products
that the store sells, and the “baskets” are the sets of items in a single market basket. A major
chain might sell 100,000 different items and collect data about millions of market baskets.

Association Rules:

While the subject of this chapter is extracting frequent sets of items from data, this information is
often presented as a collection of if–then rules, called association rules. The form of an
association rule is I → j, where I is a set of items and j is an item. The implication of this
association rule is that if all of the items in I appear in some basket, then j is “likely” to appear in
that basket as well.

Handling Larger Datasets in Main Memory:

The A-Priori Algorithm is fine as long as the step with the greatest requirement for main memory
– typically the counting of the candidate pairs C2 – has enough memory that it can be
accomplished without thrashing (repeated moving of data between disk and main memory).
Several algorithms have been proposed to cut down on the size of candidate set C2. Here, we
consider the PCY Algorithm, which takes advantage of the fact that in the first pass of A-Priori
there is typically lots of main memory not needed for the counting of single items. Then we look
at the Multistage Algorithm, which uses the PCY trick and also inserts extra passes to further
reduce the size of C2.

Algorithm of Park, Chen, and Yu:

This algorithm, which we call PCY after its authors, exploits the observation that there may be
much unused space in main memory on the first pass. If there are a million items and gigabytes
of main memory, we do not need more than 10% of the main memory for the two tables
suggested a translation table from item names to small integers and an array to count those
integers.

Multistage Algorithm:

The Multistage Algorithm improves upon PCY by using several successive hash tables to reduce
further the number of candidate pairs. The tradeoff is that Multistage takes more than two passes
to find the frequent pairs. An outline of the Multistage Algorithm is shown below.

Multihash Algorithm:

Sometimes, we can get most of the benefit of the extra passes of the Multistage Algorithm in a
single pass. This variation of PCY is called the Multihash Algorithm. Instead of using two
different hash tables on two successive passes, use two hash functions and two separate hash
tables that share main memory on the first pass. The danger of using two hash tables on one pass
is that each hash table has half as many buckets as the one large hash table of PCY. As long as
the average count of a bucket for PCY is much lower than the support threshold, we can operate
two half-sized hash tables and still expect most of the buckets of both hash tables to be
infrequent. Thus, in this situation we might well choose the multihash approach.
SON Algorithm and MapReduce:

The SON algorithm lends itself well to a parallel-computing environment. Each of the chunks
can be processed in parallel, and the frequent itemsets from each chunk combined to form the
candidates. We can distribute the candidates to many processors, have each processor count the
support for each candidatein a subset of the baskets, and finally sum those supports to get the
support for each candidate itemset in the whole dataset. This process does not haveto be
implemented in MapReduce, but there is a natural way of expressing each of the two passes as a
MapReduce operation. We shall summarize this MapReduce-MapReduce sequence below.

First Map Function:

Take the assigned subset of the baskets and find the itemsets frequent in the subset using the
algorithm of Section 6.4.1. As described there, lower the support threshold from s to ps if each
Map task gets fraction p of the total input file. The output is a set of key-value pairs (F,1), where
F is a frequent itemset from the sample. The value is always 1 and is irrelevant.

First Reduce Function:

Each Reduce task is assigned a set of keys, which are itemsets. The value is ignored, and the
Reduce task simply produces those keys (itemsets) that appear one or more times. Thus, the
output of the first Reduce function is the candidate itemsets.

Second Map Function:

The Map tasks for the second Map function take all the output from the first Reduce Function
(the candidate itemsets) and a portion of the input data file. Each Map task counts the number of
occurrences of each of the candidate itemsets among the baskets in the portion of the dataset that
it was assigned. The output is a set of key-value pairs (C, v), where C is one of the candidate sets
and v is the support for that itemset among the baskets that were input to this Map task.

Second Reduce Function:


The Reduce tasks take the itemsets they are given as keys and sum the associated values. The
result is the total support for each of the itemsets that the Reduce task was assigned to handle.
Those itemsets whose sum of values is at least s are frequent in the whole dataset, so the Reduce
task outputs these itemsets with their counts. Itemsets that do not have total support at least s are
not transmitted to the output of the Reduce task.2

Conclusion:
Experiment no.6
Aim: - Case Study on Recommendation System.

Content-Based Recommendations

There are two basic architectures for a recommendation system:


1. Content-Based systems focus on properties of items. Similarity of items
is determined by measuring the similarity in their properties.
2. Collaborative-Filtering systems focus on the relationship between users
and items. Similarity of items is determined by the similarity of the
ratings of those items by the users who have rated both items.
In this section, we focus on content-based recommendation systems. The next
section will cover collaborative filtering.

Item Profiles:-

In a content-based system, we must construct for each item a profile, which is


a record or collection of records representing important characteristics of that
item. In simple cases, the profile consists of some characteristics of the item
that are easily discovered. For example, consider the features of a movie that
might be relevant to a recommendation system.
1. The set of actors of the movie. Some viewers prefer movies with their
favorite actors.
2. The director. Some viewers have a preference for the work of certain
directors.
3. The year in which the movie was made. Some viewers prefer old movies;
others watch only the latest releases.
4. The genre or general type of movie. Some viewers like only comedies,
others dramas or romances. Genre is a vaguer concept. However, movie reviews generally assign
a genre from a set of commonly used terms. For example the Internet Movie
Database (IMDB) assigns a genre or genres to every movie.

Discovering Features of Documents:-

There are other classes of items where it is not immediately apparent what the
values of features should be. We shall consider two of them: document collections
and images. Documents present special problems, and we shall discuss
the technology for extracting features from documents in this section.
There are many kinds of documents for which a recommendation system can
be useful. For example, there are many news articles published each day, and
we cannot read all of them. A recommendation system can suggest articles on
topics a user is interested in, but how can we distinguish among topics? Web
pages are also a collection of documents. Can we suggest pages a user might
want to see? Likewise, blogs could be recommended to interested users, if we
could classify blogs by topics.
Unfortunately, these classes of documents do not tend to have readily available
information giving features. A substitute that has been useful in practice is
the identification of words that characterize the topic of a document
First, eliminate stop words –the several hundred most common words, which tend to say little
about the topic of a document. For the remaining words, compute the TF.IDF score for
each word in the document. The ones with the highest scores are the words that characterize the
document.

Obtaining Item Features From Tags:-

The problem with images is that their data, typically an array of pixels, does not tell us anything
useful about their features. We can calculate simple properties of pixels, such as the average
amount of red in the picture, but few users are looking for red pictures or especially like red
pictures. There have been a number of attempts to obtain information about features of items by
inviting users to tag the items by entering words or phrases that describe the item. Thus, one
picture with a lot of red might be tagged “Tiananmen Square,” while another is tagged “sunset at
Malibu.” The distinction is not something that could be discovered by existing image-analysis
programs.

Representing Item Profiles:-

We imagined a vector of 0’s and 1’s, where a 1 represented the occurrence of a high-TF.IDF
word
in the document. Since features for documents were all words, it was easy to represent profiles
this way.
We shall try to generalize this vector approach to all sorts of features. It is easy to do so for
features that are sets of discrete values. For example, if onefeature of movies is the set of actors,
then imagine that there is a componentfor each actor, with 1 if the actor is in the movie, and 0 if
not. Likewise, wecan have a component for each possible director, and each possible genre.
Allthese features can be represented using only 0’s and 1’s.There is another class of features that
is not readily represented by booleanvectors: those features that are numerical. For instance, we
might take theaverage rating for movies to be a feature,2 and this average is a real number.It
does not make sense to have one component for each of the possible averageratings, and doing
so would cause us to lose the structure implicit in numbers.That is, two ratings that are close but
not identical should be considered moresimilar than widely differing ratings. Likewise,
numerical features of products,such as screen size or disk capacity for PC’s, should be
considered similar if their values do not differ greatly.

Conclusion
Experiment no.7
Aim: Big data solution for Healthcare and Public Health Industry

Introduction:

Big data is generating a lot of hype in every industry including healthcare. As my colleagues and
I talk to leaders at health systems, we’ve learned that they’re looking for answers about big data.
They’ve heard that it’s something important and that they need to be thinking about it. But they
don’t really know what they’re supposed to do with it. So they turn to us with questions like:

 When will I need big data?


 What should I do to prepare for big data?
 What’s the best way to use big data?
 What is Health Catalyst doing with big data?

This piece will tackle such questions head-on. It’s important to separate the reality from the hype
and clearly describe the place of big data in healthcare today, along with the role it will play in
the future.

Healthcare has long focused on the individual. Only in the last 200 years have researchers begun
to pool patients into groups to investigate the health of populations. Big Data offers healthcare
the opportunity for a paradigm shift: Instead of thinking from the patient level up, there will now
be enough good data to look at whole populations to extrapolate what will happen to an
individual.

Big Data will transform approaches in healthcare that have long defined the industry. No longer
will we pool data from individuals to predict what happens at the population level. Instead,
population data will be so comprehensive that it will accurately predict what happens to an
individual patient.

Healthcare is probably one of the most data-intensive industries around. Basically, there are four
main sources generating all this healthcare data: Medical care providers, public and private
payers, ancillary service providers – from pharmacies to laboratories –, and healthcare
consumers. The challenge is not just in storage and access, but in making this data usable.

Changes infuture patient care:

Big data will create a convenient, real time healthcare experience for patients. Instead gleaned
from that data will improve the quality and accessibility of care, and help foster a spirit of co-
operation and research between patients and providers.

A patient’s electronic health record (EHR) will form the hub of patient care. Instead of
manuallyentering data, medical devices will automatically upload the generated data to the EHR,
adding convenience and reliability. Data can also be combined with lifestyle devices that monitor
exercise, sleep cycles, and heart rate. This gives physicians a more cohesive picture of a patient’s
overall health status.
Big Data in Healthcare Today:

A number of use cases in healthcare are well suited for a big data solution. Some academic- or
research-focused healthcare institutions are either experimenting with big data or using it in
advanced research projects. Those institutions draw upon data scientists, statisticians, graduate
students, and the like to wrangle the complexities of big data. In the following sections, we’ll
address some of those complexities and what’s being done to simplify big data and make it more
accessible.

Advantages:

1: Optimizing facility performance: To put it simply, the more you know about your facility, the
better it runs. Big data technology allows facility managers to do side-by-side examinations of
performance reports in areas ranging from power monitoring to security services

2: Reducing energy costs: This is another case where more information naturally leads to better
performance. A decrease in overall energy use comes with the benefit of decreased energy costs.
It’s also likely to create a more environmentally-conscious facility.

3: Increasing access to information: With Internet and cloud storage technologies constantly
improving and expanding Big Data can be accessed from just about anywhere.It’s also beneficial
for facility management, operators, and engineers, who can discover and deal with problems
remotely.

4: Managing with greater agility: Access to live updates and alerts allows facility managers and
hospital staff to respond quickly and efficiently to problems and concerns. With digital hospital
solutions that send the right information to the right people in real-time, users can make informed
decisions and get ahead of the issue.

5: Proactively maintaining equipment: Hospitals and healthcare facilities have a significant


physical and technological infrastructure, and in order to ensure smooth and problem-free facility
operation, it’s vital to keep equipment in good working order. Big Data allows equipment
diagnostics and analytics to run in real

Security:

In healthcare, HIPAA compliance is non-negotiable. Nothing is more important than the privacy
and security of patient data. But, frankly, there aren’t many good, integrated ways to manage
security in big data. Although security is coming along, it has been an afterthought up to this
point. And for good reason. If a hospital only has to grant access to a couple of data scientists, it
really doesn’t have too much to worry about. But when opening up access to a large, diverse
group of users, security cannot be an afterthought.

Healthcare organizations can take some steps today to ensure better security of big data. Big data
runs on open source technology with inconsistent security technology. To avoid big problems,
organizations should be selective about big data vendors and avoid assuming that any big data
distribution they select will be secure.
Real World Example Healthcare’s Transition to Big Data

Here is a brief example of how the transition from relational databases to big data is happening
in the real world. We, at Health Catalyst, are working with one of our large health system clients
and Microsoft to create a massively parallel data warehouse in a Microsoft APS Appliance that
also includes a Hortonworks Hadoop Cluster. This means we can run a traditional relational
database and a big data cluster in parallel. We can query both data stores simultaneously, which
significantly improves our data processing power.

Conclusion:
Experiment no.8
Aim: Case Study on Mining Social Network Graph.

What is a Social Network?

When we think of a social network, we think of Facebook, Twitter, Google+, or another website
that is called a “social network,” and indeed this kind of network is representative of the broader
class of networks called “social.” The essential characteristics of a social network are: 1. There is
a collection of entities that participate in the network. Typically, these entities are people, but
they could be something else entirely. There is at least one relationship between entities of the
network. On Facebook or its ilk, this relationship is called friends. Sometimes the relationship is
all-or-nothing; two people are either friends or they are not. However, in other examples of
social networks, the relationship has a degree. This degree could be discrete; e.g., friends, family,
acquaintances, or none as in Google+. It could be a real number; an example would be the
fraction of the average day that two people spend talking to each other. 3. There is an assumption
of nonrandomness or locality. This condition is the hardest to formalize, but the intuition is that
relationships tend to cluster. That is, if entity A is related to both B and C, then there is a higher
probability than average that B and C are related.

SOCIAL NETWORKS AS GRAPHS:

Social networks are naturally modeled as graphs, which we sometimes refer to as a social graph.
The entities are the nodes, and an edge connects two nodes if the nodes are related by the
relationship that characterizes the network. If there is a degree associated with the relationship,
this degree is represented by labeling the edges. Often, social graphs are undirected, as for the
Facebook friends graph. But they can be directed graphs, as for example the graphs of followers
on Twitter or Google+. Example 10.1 : Figure 10.1 is an example of a tiny social network. The
entities are the nodes A through G. The relationship, which we might think of as “friends,” is
represented by the edges. For instance, B is friends with A, C, and D. Is this graph really typical
of a social network, in the sense that it exhibits locality of relationships? First, note that the graph
has nine edges out of the 7 2 = 21 pairs of nodes that could have had an edge between them.
Suppose X, Y , and Z are nodes of Fig. 10.1, with edges between X and Y and also between X
and Z. What would we expect the probability of an edge between Y and Z to be? If the graph
were large, that probability would be very close to the fraction of the pairs of nodes that have
edges between them, i.e., 9/21 = .429 in this case. However, because the graph is small, there is a
noticeable difference between the true probability and the ratio of the number of edges to the
number of pairs of nodes. Since we already know there are edges (X, Y ) and (X, Z), there are
only seven edges remaining. Those seven edges could run between any of the 19 remaining pairs
of nodes. Thus, the probability of an edge (Y, Z) is 7/19 = .368.
Varieties of Social Networks:

There are many examples of social networks other than “friends” networks. Here, let us
enumerate some of the other examples of networks that also exhibit locality of relationships.

Telephone Networks: Here the nodes represent phone numbers, which are really individuals.
There is an edge between two nodes if a call has been placed between those phones in some
fixed period of time, such as last month, or “ever.” The edges could be weighted by the number
of calls made between these phones during the period. Communities in a telephone network will
form from groups of people that communicate frequently: groups of friends, members of a club,
or people working at the same company, for example.

Email Networks :The nodes represent email addresses, which are again individuals. An edge
represents the fact that there was at least one email in at least one direction between the two
addresses. Alternatively, we may only place an edge if there were emails in both directions. In
that way, we avoid viewing spammers as “friends” with all their victims. Another approach is to
label edges as weak or strong. Strong edges represent communication in both directions, while
weak edges indicate that the communication was in one direction only. The communities seen in
email networks come from the same sorts of groupings we mentioned in connection with
telephone networks. A similar sort of network involves people who text other people through
their cell phones.

Collaboration Networks: Nodes represent individuals who have published research papers.
There is an edge between two individuals who published one or more papers jointly. Optionally,
we can label edges by the number of joint publications. The communities in this network are
authors working on a particular topic. An alternative view of the same data is as a graph in which
the nodes are papers. Two papers are connected by an edge if they have at least one author in
common. Now, we form communities that are collections of papers on the same topic. There are
several other kinds of data that form two networks in a similar way. For example, we can look at
the people who edit Wikipedia articles and the articles that they edit. Two editors are connected
if they have edited an article in common. The communities are groups of editors that are
interested in the same subject. Dually, we can build a network of articles, and connect articles if
they have been edited by the same person. Here, we get communities of articles on similar or
related subjects.

CLUSTERING OF SOCIAL-NETWORK GRAPHS:


An important aspect of social networks is that they contain communities of entities that are
connected by many edges. These typically correspond to groups of friends at school or groups of
researchers interested in the same topic, for example. In this section, we shall consider clustering
of the graph as a way to identify communities. It turns out that the techniques we learned in
Chapter 7 are generally unsuitable for the problem of clustering social-network graphs.

Distance Measures for Social-Network Graphs: If we were to apply standard clustering


techniques to a social-network graph, our first step would be to define a distance measure. When
the edges of the graph have labels, these labels might be usable as a distance measure, depending
on what they represented. But when the edges are unlabeled, as in a “friends” graph, there is not
much we can do to define a suitable distance.

Our first instinct is to assume that nodes are close if they have an edge between them and distant
if not. Thus, we could say that the distance d(x, y) is 0 if there is an edge (x, y) and 1 if there is
no such edge. We could use any other two values, such as 1 and ∞, as long as the distance is
closer when there is an edge.

Applying Standard Clustering Methods:

There are two general approaches to clustering: hierarchical (agglomerative) and point-
assignment. Let us consider how each of these would work on a social-network graph. First,
consider the hierarchical methods. In particular, suppose we use as the intercluster distance the
minimum distance between nodes of the two clusters. Hierarchical clustering of a social-network
graph starts by combining some two nodes that are connected by an edge. Successively, edges
that are not between two nodes of the same cluster would be chosen randomly to combine the
clusters to which their two nodes belong. The choices would be random, because all distances
represented by an edge are the same.

Betweenness: Since there are problems with standard clustering methods, several specialized
clustering techniques have been developed to find communities in social networks. In this
section we shall consider one of the simplest, based on finding the edges that are least likely to
be inside a community. Define the betweenness of an edge (a, b) to be the number of pairs of
nodes x and y such that the edge (a, b) lies on the shortest path between x and y. To be more
precise, since there can be several shortest paths between x and y, edge (a, b) is credited with the
fraction of those shortest paths that include the edge (a, b). As in golf, a high score is bad. It
suggests that the edge (a, b) runs between two different communities; that is, a and b do not
belong to the same community.

The Girvan-Newman Algorithm: In order to exploit the betweenness of edges, we need to


calculate the number of shortest paths going through each edge. We shall describe a method
called the Girvan-Newman (GN) Algorithm, which visits each node X once and computes the
number of shortest paths from X to each of the other nodes that go through each of the edges.
The algorithm begins by performing a breadth-first search (BFS) of the graph, starting at the
node X. Note that the level of each node in the BFS presentation is the length of the shortest path
from X to that node. Thus, the edges that go between nodes at the same level can never be part of
a shortest path from X. Edges between levels are called DAG edges (“DAG” stands for directed,
acyclic graph). Each DAG edge will be part of at least one shortest path from root X. If there is a
DAG edge (Y, Z), where Y is at the level above Z (i.e., closer to the root), then we shall call Y a
parent of Z and Z a child of Y , although parents are not necessarily unique in a DAG as they
would be in a tree.

Figure 10.4 is a breadth-first presentation of the graph of Fig. 10.3, starting at node E. Solid
edges are DAG edges and dashed edges connect nodes at the same level. ✷ The second step of
the GN algorithm is to label each node by the number of shortest paths that reach it from the
root. Start by labeling the root 1. Then, from the top down, label each node Y by the sum of the
labels of its parents.

Using Betweenness to Find Communities: The betweenness scores for the edges of a graph
behave something like a distance measure on the nodes of the graph. It is not exactly a distance
measure, because it is not defined for pairs of nodes that are unconnected by an edge, and might
not satisfy the triangle inequality even when defined. However, we can cluster by taking the
edges in order of increasing betweenness and add them to the graph one at a time. At each step,
the connected components of the graph form some clusters. The higher the betweenness we
allow, the more edges we get, and the larger the clusters become. More commonly, this idea is
expressed as a process of edge removal. Start with the graph and all its edges; then remove edges
with the highest between- ness, until the graph has broken into a suitable number of connected
components.

Direct Discovery of Communities: In the previous section we searched for communities by


partitioning all the individuals in a social network. While this approach is relatively efficient, it
does have several limitations. It is not possible to place an individual in two different
communities, and everyone is assigned to a community. In this section, we shall see a technique
for discovering communities directly by looking for subsets of the nodes that have a relatively
large number of edges among them. Interestingly, the technique for doing this search on a large
graph involves finding large frequent itemsets.

Finding Cliques: Our first thought about how we could find sets of nodes with many edges
between them is to start by finding a large clique (a set of nodes with edges between any two of
them). However, that task is not easy. Not only is finding maximal cliques NP-complete, but it is
among the hardest of the NP-complete problems in the sense that even approximating the
maximal clique is hard. Further, it is possible to have a set of nodes with almost all edges
between them, and yet have only relatively small cliques.

Complete Bipartite Graphs: A complete bipartite graph consists of s nodes on one side and t
nodes on the other side, with all st possible edges between the nodes of one side and the other
present. We denote this graph by Ks,t. You should draw an analogy between complete bipartite
graphs as subgraphs of general bipartite graphs and cliques as subgraphs of general graphs. In
fact, a clique of s nodes is often referred to as a complete graph and denoted Ks, while a
complete bipartite subgraph is sometimes called a bi-clique.

SIMRANK:

In this section, we shall take up another approach to analyzing social-network graphs. This
technique, called “simrank,” applies best to graphs with several types of nodes, although it can in
principle be applied to any graph. The purpose of simrank is to measure the similarity between
nodes of the same type, and it does so by seeing where random walkers on the graph wind up
when starting at a particular node.

Random Walkers on a Social Graph: We can similarly think of a person “walking” on a


social network. The graph of a social network is generally undirected, while the Web graph is
directed. However, the difference is unimportant. A walker at a node N of an undirected graph
will move with equal probability to any of the neighbors of N (those nodes with which N shares
an edge). Suppose, for example, that such a walker starts out at node T1 of Fig. 10.2, which we
reproduce here as Fig. 10.21. At the first step, it would go either to U1 or W1. If to W1, then it
would next either come back to T1 or go to T2. If the walker first moved to U1, it would wind up
at either T1, T2, or T3 next. We conclude that, starting at T1, there is a good chance the walker
would visit T2, at least initially, and that chance is better than the chance it would visit T3 or T4.
It would be interesting if we could infer that tags T1 and T2 aregraphs that can be analyzed
completely in this manner.
therefore related or similar in some way. The evidence is that they have both been placed on a
common Web page, W1, and they have also been used by a common tagger, U1. However, if we
allow the walker to continue traversing the graph at random, then the probability that the walker
will be at any particular node does not depend on where it starts out. This conclusion comes from
the theory of Markov processes that we mentioned in Section 5.1.2, although the independence
from the starting point requires additional conditions besides connectedness that the graph of Fig.
10.21 does satisfy.

Random Walks with Restart :We see from the observations above that it is not possible to
measure similarity to a particular node by looking at the limiting distribution of the walker.
However, we have already seen, in Section 5.1.5, the introduction of a small probability that the
walker will stop walking at random. Later, we saw in Section 5.3.2 that there were reasons to
select only a subset of Web pages as the teleport set, the pages that the walker would go to when
they stopped surfing the Web at random. Here, we take this idea to the extreme. As we are
focused on one particular node N of a social network, and want to see where the random walker
winds up on short walks from that node, we modify the matrix of transition probabilities to have
a small additional probability of transitioning to N from any node. Formally, let M be the
transition matrix of the graph G. That is, the entry in row i and column j of M is 1/k if node j of
G has degree k, and one of the adjacent nodes is i. Otherwise, this entry is 0. We shall discuss
teleporting in a moment, but first, let us look at a simple example of a transition matrix.

E(X, Y ) ⊲⊳ E(X, Z) ⊲⊳ E(Y, Z)

To understand this join, we have to recognize that the attributes of the relation E are given
different names in each of the three uses of E. That is, we imagine there are three copies of E,
each with the same tuples, but with a different schemas. In SQL, this join would be written using
a single relation E(A, B) as follows:

SELECT e1.A, e1.B, e2.B

FROM E e1, E e2, E e3

WHERE e1.A = e2.A AND e1.B = e3.A AND e2.B = e3.B

In this query, the equated attributes e1.A and e2.A are represented in our join by the attribute X.
Also, e1.B and e3.A are each represented by Y ; e2.B and e3.B are represented by Z. Notice that
each triangle appears once in this join. The triangle consisting of nodes v1, v2, and v3 is
generated when X, Y , and Z are these three nodes in numerical order, i.e., X < Y < Z. For
instance, if the numerical order of the nodes is v1 < v2 < v3, then X can only be v1, Y is v2, and
Z is v3where we considered the number of ways in which the values of each attribute should be
hashed. In the present example, the matter is quite simple. The three occurrences of relation E
surely have the same size, so by symmetry, attributes X, Y , and Z will each be hashed to the
same number of buckets. That is, the minimum communication cost is O(mb) if we use b 3
Reduce tasks. Next, let us compute the total execution cost at all the Reduce tasks. Assume that
the hash function distributes edges sufficiently randomly that the Reduce tasks each get
approximately the same number of edges. Since the total number of edges distributed to the b 3
Reduce tasks is O(mb), it follows that each task receives O(m/b2 ) edges. If we use the algorithm
of Section 10.7.2 at each Reduce task, the total computation at a task is O (m/b2 ) 3/2 , or
O(m3/2/b3 ). Since there are b 3 Reduce tasks, the total computation cost is O(m3/2).

CONCLUSION:

You might also like