Unit 2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 23

Apache Hadoop is composed of several core components that work together

to enable distributed storage and processing of large datasets. These core


components include:

1. Hadoop Distributed File System (HDFS): HDFS is a distributed file system


designed to store large files across multiple machines in a Hadoop cluster. It
provides high availability, fault tolerance, and scalability by splitting large files
into smaller blocks and distributing them across the cluster's nodes. HDFS
forms the storage layer of Hadoop.

2. Yet Another Resource Negotiator (YARN): YARN is the resource


management layer of Hadoop. It is responsible for managing and allocating
resources (CPU, memory, etc.) across applications running in the cluster.
YARN decouples the resource management and job scheduling functionalities,
allowing for more flexibility and scalability in deploying different processing
frameworks.

3. MapReduce: MapReduce is a programming model and processing engine


for large-scale data processing in Hadoop. It consists of two main phases: the
Map phase, where data is processed in parallel across multiple nodes, and the
Reduce phase, where the results of the Map phase are aggregated and
combined to produce the final output.

4. Hadoop Common: Hadoop Common includes libraries and utilities shared


by all Hadoop modules. It provides the necessary infrastructure for running
Hadoop applications, including tools for configuration management, security,
and logging.

These core components work together to enable distributed storage, resource


management, and processing of large datasets in Hadoop clusters.
Additionally, Hadoop ecosystem projects build upon these core components
to provide additional functionalities such as data querying (Apache Hive,
Apache Pig), real-time processing (Apache Spark, Apache Flink), and data
ingestion (Apache Kafka, Apache Flume).

Hadoop Ecosystem



Overview: Apache Hadoop is an open source framework intended to make interaction
with big data easier, However, for those who are not acquainted with this technology,
one question arises that what is big data ? Big data is a term given to the data sets
which can’t be processed in an efficient manner with the help of traditional
methodology such as RDBMS. Hadoop has made its place in the industries and
companies that need to work on large data sets which are sensitive and needs efficient
handling. Hadoop is a framework that enables processing of large data sets which
reside in the form of clusters. Being a framework, Hadoop is made up of several
modules that are supported by a large ecosystem of technologies.

Introduction: Hadoop Ecosystem is a platform or a suite which provides various


services to solve the big data problems. It includes Apache projects and various
commercial tools and solutions. There are four major elements of Hadoop i.e. HDFS,
MapReduce, YARN, and Hadoop Common Utilities. Most of the tools or solutions
are used to supplement or support these major elements. All these tools work
collectively to provide services such as absorption, analysis, storage and maintenance
of data etc.

Following are the components that collectively form a Hadoop ecosystem:

 HDFS: Hadoop Distributed File System


 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
Note: Apart from the above-mentioned components, there are many other components
too that are part of the Hadoop ecosystem.

All these toolkits or components revolve around one term i.e. Data. That’s the beauty
of Hadoop that it revolves around data and hence making its synthesis easier.

HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is


responsible for storing large data sets of structured or unstructured data
across various nodes and thereby maintaining the metadata in the form of
log files.
 HDFS consists of two core components i.e.

1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the
actual data. These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware,
thus working at the heart of the system.
YARN:

 Yet Another Resource Negotiator, as the name implies, YARN is the one
who helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the allocation of
resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs
negotiations as per the requirement of the two.

MapReduce:

 By making the use of distributed and parallel algorithms, MapReduce


makes it possible to carry over the processing’s logic and helps to write
applications which transform big data sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose
task is:

1. Map() performs sorting and filtering of data and thereby


organizing them in the form of group. Map generates a key-value
pair based result which is later on processed by the Reduce()
method.
2. Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the
output generated by Map() as input and combines those tuples
into smaller set of tuples.

PIG:

Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge
data sets.
 Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores
the result in HDFS.
 Pig Latin language is specially designed for this framework which runs on
Pig Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a
major segment of the Hadoop Ecosystem.

HIVE:

 With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as HQL
(Hive Query Language).
 It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.
Hadoop – Pros and Cons



Big Data has become necessary as industries are growing, the goal is to congregate
information and finding hidden facts behind the data. Data defines how industries
can improve their activity and affair. A large number of industries are revolving
around the data, there is a large amount of data that is gathered and analyzed through
various processes with various tools. Hadoop is one of the tools to deal with this
huge amount of data as it can easily extract the information from data, Hadoop has
its Advantages and Disadvantages while we deal with Big Data.
Pros

1. Cost Hadoop is open-source and uses cost-effective commodity hardware which


provides a cost-efficient model, unlike traditional Relational databases that require
expensive hardware and high-end processors to deal with Big Data. The problem
with traditional Relational databases is that storing the Massive volume of data is not
cost-effective, so the company’s started to remove the Raw data. which may not
result in the correct scenario of their business. Means Hadoop provides us 2 main
benefits with the cost one is it’s open-source means free to use and the other is that it
uses commodity hardware which is also inexpensive.
2. Scalability Hadoop is a highly scalable model. A large amount of data is divided
into multiple inexpensive machines in a cluster which is processed parallelly. the
number of these machines or nodes can be increased or decreased as per the
enterprise’s requirements. In traditional RDBMS(Relational DataBase Management
System) the systems can not be scaled to approach large amounts of data.
3. Flexibility Hadoop is designed in such a way that it can deal with any kind of
dataset like structured(MySql Data), Semi-Structured(XML, JSON), Un-structured
(Images and Videos) very efficiently. This means it can easily process any kind of
data independent of its structure which makes it highly flexible. which is very much
useful for enterprises as they can process large datasets easily, so the businesses can
use Hadoop to analyze valuable insights of data from sources like social media,
email, etc. with this flexibility Hadoop can be used with log processing, Data
Warehousing, Fraud detection, etc.
4. Speed Hadoop uses a distributed file system to manage its storage i.e.
HDFS(Hadoop Distributed File System). In DFS(Distributed File System) a large
size file is broken into small size file blocks then distributed among the Nodes
available in a Hadoop cluster, as this massive number of file blocks are processed
parallelly which makes Hadoop faster, because of which it provides a High-level
performance as compared to the traditional DataBase Management Systems. When
you are dealing with a large amount of unstructured data speed is an important
factor, with Hadoop you can easily access TB’s of data in just a few minutes.
5. Fault Tolerance Hadoop uses commodity hardware(inexpensive systems) which
can be crashed at any moment. In Hadoop data is replicated on various DataNodes in
a Hadoop cluster which ensures the availability of data if somehow any of your
systems got crashed. You can read all of the data from a single machine if this
machine faces a technical issue data can also be read from other nodes in a Hadoop
cluster because the data is copied or replicated by default. Hadoop makes 3 copies of
each file block and stored it into different nodes.
6. High Throughput Hadoop works on Distributed file System where various jobs
are assigned to various Data node in a cluster, the bar of this data is processed
parallelly in the Hadoop cluster which produces high throughput. Throughput is
nothing but the task or job done per unit time.
7. Minimum Network Traffic In Hadoop, each task is divided into various small
sub-task which is then assigned to each data node available in the Hadoop cluster.
Each data node process a small amount of data which leads to low traffic in a
Hadoop cluster.
Cons

1. Problem with Small files Hadoop can efficiently perform over a small number of
files of large size. Hadoop stores the file in the form of file blocks which are from
128MB in size(by default) to 256MB. Hadoop fails when it needs to access the small
size file in a large amount. This so many small files surcharge the Namenode and
make it difficult to work.
2. Vulnerability Hadoop is a framework that is written in java, and java is one of the
most commonly used programming languages which makes it more insecure as it
can be easily exploited by any of the cyber-criminal.
3. Low Performance In Small Data Surrounding Hadoop is mainly designed for
dealing with large datasets, so it can be efficiently utilized for the organizations that
are generating a massive volume of data. It’s efficiency decreases while performing
in small data surroundings.
4. Lack of Security Data is everything for an organization, by default the security
feature in Hadoop is made un-available. So the Data driver needs to be careful with
this security face and should take appropriate action on it. Hadoop uses Kerberos for
security feature which is not easy to manage. Storage and network encryption are
missing in Kerberos which makes us more concerned about it.
5. High Up Processing Read/Write operation in Hadoop is immoderate since we are
dealing with large size data that is in TB or PB. In Hadoop, the data read or write
done from the disk which makes it difficult to perform in-memory calculation and
lead to processing overhead or High up processing.
6. Supports Only Batch Processing The batch process is nothing but the processes
that are running in the background and does not have any kind of interaction with the
user. The engines used for these processes inside the Hadoop core is not that much
efficient. Producing the output with low latency is not possible with it.

What is HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed
over several machines and replicated to ensure their durability to failure and high
availability to parallel application.

It is cost effective as it uses commodity hardware. It involves the concept of blocks,


data nodes and node name.

Where to use HDFS


o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.
o Streaming Data Access: The time to read whole data set is more important than
latency in reading the first. HDFS is built on write-once and read-many-times pattern.
o Commodity Hardware:It works on low cost hardware.

Where not to use HDFS


o Low Latency data access: Applications that require very less time to access the first
data should not use HDFS as it is giving importance to whole data rather than time to
fetch the first record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if
the files are small in size it takes a lot of memory for name node's memory which is
not feasible.
o Multiple Writes:It should not be used when we have to write multiple times.

AD

HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks
are 128 MB by default and this is configurable.Files n HDFS are broken into block-
sized chunks,which are stored as independent units.Unlike a file system, if the file is in
HDFS is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of
file stored in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size
is large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission,
names and location of each block.The metadata are small, so it is stored in the
memory of name node,allowing faster access to data. Moreover the HDFS cluster is
accessed by multiple clients concurrently,so all this information is handled bya single
machine. The file system operations like opening, closing, renaming etc. are executed
by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name
node. They report back to name node periodically, with list of blocks that they are
storing. The data node being a commodity hardware also does the work of block
creation, deletion and replication as stated by the name node.

HDFS DataNode and NameNode Image:

HDFS Read Image:


HDFS Write Image:

Since all the metadata is stored in name node, it is very important. If it fails the file
system can not be used as there would be no way of knowing how to reconstruct the
files from blocks present in data node. To overcome this, the concept of secondary
name node arises.
Secondary Name Node: It is a separate physical machine which acts as a helper of
name node. It performs periodic check points.It communicates with the name node
and take snapshot of meta data which helps minimize downtime and loss of data.

AD

Processing Data with Hadoop :-


Hadoop is a popular open-source framework for distributed storage and processing of large data sets
using a cluster of commodity hardware. It is designed to scale from a single server to thousands of
machines, providing a reliable and efficient way to store and process vast amounts of data. Hadoop is
based on the MapReduce programming model, where data processing tasks are divided into smaller
sub-tasks and distributed across a cluster for parallel execution.

Here's a general overview of how data processing works with Hadoop:

1. **Storage:**

- **Hadoop Distributed File System (HDFS):** Hadoop stores data in a distributed file system called
HDFS. Data is divided into blocks and distributed across multiple nodes in the cluster for fault
tolerance and parallel processing.

2. **Processing Model:**

- **MapReduce:** The core processing model in Hadoop is MapReduce. It consists of two main
phases - the Map phase and the Reduce phase.

- **Map Phase:** Input data is divided into smaller chunks, and a "mapper" task processes each
chunk independently, generating key-value pairs as output.

- **Shuffle and Sort:** The framework then shuffles and sorts the intermediate key-value pairs,
grouping them by key.

- **Reduce Phase:** The "reducer" tasks process the sorted key-value pairs, aggregating and
producing the final output.

3. **Programming Model:**

- **MapReduce API:** Developers can write MapReduce programs using the Hadoop MapReduce
API, typically in Java. However, there are also other higher-level abstractions and languages available,
such as Apache Pig, Apache Hive, and Apache Spark, which simplify the development process.

4. **Job Submission:**
- **Hadoop Job Submission:** Once the MapReduce program is written, it needs to be packaged
into a JAR file and submitted to the Hadoop cluster. The Hadoop JobTracker manages the execution
of the job across the cluster.

5. **Data Processing Tools:**

- **Apache Hive:** Provides a SQL-like interface for querying and managing large datasets stored in
Hadoop.

- **Apache Pig:** A high-level scripting language for expressing data analysis programs that are
translated into MapReduce jobs.

- **Apache Spark:** An open-source, distributed computing system that supports in-memory


processing and offers a more flexible programming model than MapReduce.

6. **Monitoring and Management:**

- **Hadoop Ecosystem Tools:** Various tools such as Apache Ambari, Apache Hadoop YARN, and
Apache Hadoop MapReduce provide monitoring, resource management, and job tracking
capabilities.

7. **Integration with other systems:**

- **Data Ingestion:** Hadoop can ingest data from various sources, and tools like Apache Flume or
Apache Kafka can be used for efficient data ingestion.

- **Data Export:** Data processed in Hadoop can be exported to other systems or databases for
further analysis or reporting.

8. **Scaling:**

- **Horizontal Scaling:** Hadoop allows for easy scaling by adding more nodes to the cluster,
providing the ability to handle increasing amounts of data and processing requirements.

Keep in mind that Hadoop has evolved over time, and new technologies may have emerged since my
last update. Always refer to the latest documentation for the most up-to-date information.

Hadoop YARN Architecture


Last Updated : 24 Apr, 2023



YARN stands for “Yet Another Resource Negotiator“. It was introduced in
Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in Hadoop
1.0. YARN was described as a “Redesigned Resource Manager” at the time of its
launching, but it has now evolved to be known as large-scale distributed operating
system used for Big Data processing.

YARN architecture basically separates resource management layer from the


processing layer. In Hadoop 1.0 version, the responsibility of Job tracker is split
between the resource manager and application manager.
YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run and
process data stored in HDFS (Hadoop Distributed File System) thus making the
system much more efficient. Through its various components, it can dynamically
allocate various resources and schedule the application processing. For large
volume data processing, it is quite necessary to manage the available resources
properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-

 Scalability: The scheduler in Resource manager of YARN architecture


allows Hadoop to extend and manage thousands of nodes and clusters.
 Compatibility: YARN supports the existing map-reduce applications
without disruptions thus making it compatible with Hadoop 1.0 as well.
 Cluster Utilization:Since YARN supports Dynamic utilization of cluster
in Hadoop, which enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving
organizations a benefit of multi-tenancy.

Hadoop YARN Architecture


The main components of YARN architecture include:

 Client: It submits map-reduce jobs.


 Resource Manager: It is the master daemon of YARN and is responsible
for resource assignment and management among all the applications.
Whenever it receives a processing request, it forwards it to the
corresponding node manager and allocates resources for the completion
of the request accordingly. It has two major components:
 Scheduler: It performs scheduling based on the allocated
application and available resources. It is a pure scheduler,
means it does not perform other tasks such as monitoring or
tracking and does not guarantee a restart if a task fails. The
YARN scheduler supports plugins such as Capacity Scheduler
and Fair Scheduler to partition the cluster resources.
 Application manager: It is responsible for accepting the
application and negotiating the first container from the
resource manager. It also restarts the Application Master
container if a task fails.
 Node Manager: It take care of individual node on Hadoop cluster and
manages application and workflow and that particular node. Its primary
job is to keep-up with the Resource Manager. It registers with the
Resource Manager and sends heartbeats with the health status of the
node. It monitors resource usage, performs log management and also
kills a container based on directions from the resource manager. It is also
responsible for creating the container process and start it on the request
of Application master.
 Application Master: An application is a single job submitted to a
framework. The application master is responsible for negotiating
resources with the resource manager, tracking the status and monitoring
progress of a single application. The application master requests the
container from the node manager by sending a Container Launch
Context(CLC) which includes everything an application needs to run.
Once the application is started, it sends the health report to the resource
manager from time-to-time.
 Container: It is a collection of physical resources such as RAM, CPU
cores and disk on a single node. The containers are invoked by Container
Launch Context(CLC) which is a record that contains information such
as environment variables, security tokens, dependencies etc.
Application workflow in Hadoop YARN:

1. Client submits an application


2. The Resource Manager allocates a container to start the Application
Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource
Manager
5. The Application Manager notifies the Node Manager to launch
containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor
application’s status
8. Once the processing is complete, the Application Manager un-registers
with the Resource Manager

Advantages :

 Flexibility: YARN offers flexibility to run various types of distributed


processing systems such as Apache Spark, Apache Flink, Apache Storm,
and others. It allows multiple processing engines to run simultaneously
on a single Hadoop cluster.
 Resource Management: YARN provides an efficient way of managing
resources in the Hadoop cluster. It allows administrators to allocate and
monitor the resources required by each application in a cluster, such as
CPU, memory, and disk space.
 Scalability: YARN is designed to be highly scalable and can handle
thousands of nodes in a cluster. It can scale up or down based on the
requirements of the applications running on the cluster.
 Improved Performance: YARN offers better performance by providing
a centralized resource management system. It ensures that the resources
are optimally utilized, and applications are efficiently scheduled on the
available resources.
 Security: YARN provides robust security features such as Kerberos
authentication, Secure Shell (SSH) access, and secure data transmission.
It ensures that the data stored and processed on the Hadoop cluster is
secure.

Disadvantages :

 Complexity: YARN adds complexity to the Hadoop ecosystem. It


requires additional configurations and settings, which can be difficult for
users who are not familiar with YARN.
 Overhead: YARN introduces additional overhead, which can slow down
the performance of the Hadoop cluster. This overhead is required for
managing resources and scheduling applications.
 Latency: YARN introduces additional latency in the Hadoop ecosystem.
This latency can be caused by resource allocation, application
scheduling, and communication between components.
 Single Point of Failure: YARN can be a single point of failure in the
Hadoop cluster. If YARN fails, it can cause the entire cluster to go down.
To avoid this, administrators need to set up a backup YARN instance for
high availability.
 Limited Support: YARN has limited support for non-Java programming
languages. Although it supports multiple processing engines, some
engines have limited language support, which can limit the usability of
YARN in certain environments.
MapReduce - Algorithm
The MapReduce algorithm contains two important tasks, namely
Map and Reduce.

 The map task is done by means of Mapper Class


 The reduce task is done by means of Reducer Class.

Mapper class takes the input, tokenizes it, maps and sorts it. The
output of Mapper class is used as input by Reducer class, which in
turn searches matching pairs and reduces them.

MapReduce implements various mathematical algorithms to


divide a task into small parts and assign them to multiple
systems. In technical terms, MapReduce algorithm helps in
sending the Map & Reduce tasks to appropriate servers in a
cluster.

These mathematical algorithms may include the following −

 Sorting
 Searching
 Indexing
 TF-IDF
Sorting

Sorting is one of the basic MapReduce algorithms to process and


analyze data. MapReduce implements sorting algorithm to
automatically sort the output key-value pairs from the mapper by
their keys.

 Sorting methods are implemented in the mapper class itself.


 In the Shuffle and Sort phase, after tokenizing the values in
the mapper class, the Context class (user-defined class)
collects the matching valued keys as a collection.
 To collect similar key-value pairs (intermediate keys), the
Mapper class takes the help of RawComparator class to sort
the key-value pairs.
 The set of intermediate key-value pairs for a given Reducer
is automatically sorted by Hadoop to form key-values (K2,
{V2, V2, …}) before they are presented to the Reducer.
Searching

Searching plays an important role in MapReduce algorithm. It


helps in the combiner phase (optional) and in the Reducer phase.
Let us try to understand how Searching works with the help of an
example.

Example

The following example shows how MapReduce employs Searching


algorithm to find out the details of the employee who draws the
highest salary in a given employee dataset.

 Let us assume we have employee data in four different files


− A, B, C, and D. Let us also assume there are duplicate
employee records in all four files because of importing the
employee data from all database tables repeatedly. See the
following illustration.

 The Map phase processes each input file and provides the
employee data in key-value pairs (<k, v> : <emp name,
salary>). See the following illustration.
The combiner phase (searching technique) will accept the input
from the Map phase as a key-value pair with employee
name and salary. Using searching technique, the combiner
will check all the employee salary to find the highest salaried
employee in each file. See the following snippet.
<k: employee name, v: salary>
Max= the salary of an first employee. Treated as max
salary

if(v(second employee).salary > Max){


Max = v(salary);
}

else{
Continue checking;
}

The expected result is as follows −

<satish, <gopal, <kiran, <manish


26000> 50000> 45000> a,
45000>

 Reducer phase − Form each file, you will find the highest
salaried employee. To avoid redundancy, check all the <k,
v> pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which
are coming from four input files. The final output should be
as follows −
<gopal, 50000>
Indexing

Normally indexing is used to point to a particular data and its


address. It performs batch indexing on the input files for a
particular Mapper.

The indexing technique that is normally used in MapReduce is


known as inverted index. Search engines like Google and Bing use
inverted indexing technique. Let us try to understand how
Indexing works with the help of a simple example.
Example

The following text is the input for inverted indexing. Here T[0],
T[1], and t[2] are the file names and their content are in double
quotes.

T[0] = "it is what it is"


T[1] = "what is it"
T[2] = "it is a banana"

After applying the Indexing algorithm, we get the following


output −

"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}

Here "a": {2} implies the term "a" appears in the T[2] file.
Similarly, "is": {0, 1, 2} implies the term "is" appears in the files
T[0], T[1], and T[2].

TF-IDF

TF-IDF is a text processing algorithm which is short for Term


Frequency − Inverse Document Frequency. It is one of the
common web analysis algorithms. Here, the term 'frequency'
refers to the number of times a term appears in a document.
Term Frequency (TF)

It measures how frequently a particular term occurs in a


document. It is calculated by the number of times a word appears
in a document divided by the total number of words in that
document.

TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of terms in the
document)
Inverse Document Frequency (IDF)

It measures the importance of a term. It is calculated by the


number of documents in the text database divided by the number
of documents where a specific term appears.

While computing TF, all the terms are considered equally


important. That means, TF counts the term frequency for normal
words like “is”, “a”, “what”, etc. Thus we need to know the
frequent terms while scaling up the rare ones, by computing the
following −

IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in it).

The algorithm is explained below with the help of a small


example.

Example
Consider a document containing 1000 words, wherein the
word hive appears 50 times. The TF for hive is then (50 / 1000) =
0.05.
Now, assume we have 10 million documents and the
word hive appears in 1000 of these. Then, the IDF is calculated as
log(10,000,000 / 1,000) = 4.

The TF-IDF weight is the product of these quantities − 0.05 × 4 =


0.20.

You might also like