Posts

Showing posts with the label UPSERT

UPSERT in Hive(3 Step Process)

Image
In this post we'll learn an efficient 3 step process, for performing UPSERT in hive on a large size table containing entire history. Just for the audience not aware of UPSERT - It is a combination of UPDATE and INSERT. If on a table containing history data, we receive new data which needs to be inserted as well as some data which is an UPDATE to the existing data, then we have to perform an UPSERT operation to achieve this. Prerequisite  – The table containing history being very large in size should be partitioned, which is also a best practice for efficient data storage, when storing large volume of data in any Big Data warehouse. Business scenario  – Lets take an example of click stream data of a website, as gathered from different browsers of visitors who visited the website. The site_view_hist table contains the clicks and page impressions counts from different browsers and the table is partitioned on hit_date(the date on which the visitor hitted or visited the websi...