Datatypes in Hive
Datatypes in Hive
Datatypes in Hive
Every other Object in Hive (say Database, Table, Partition etc.) is a reference to the HDFS. A
Database basically points to a folder in HDFS, a table does the same under the database folder,
partition again references a sub-folder under the table folder and it goes on till the leaf node
which is the file or data.
Now every table [by default] in Hive is an internal table unless specified external explicitly while
creating the table. When you create an internal table it has a strong reference to the inline file
system, which mean if you drop the table in Hive, the data it is referencing will also be deleted.
The external table is the opposite of the internal table. It does have reference to the data but has a
loose coupling with the data. When you drop the table in Hive the data remains intact.
Now the question is when would you use which one ? There are many scenarios. The very basic
is: Lets say you are having a table in Hive which is the Central Dump of all the data and I want
to use it temporarily for some computation. But I belong to some other user-group who doesn’t
have access on the data. Instead of giving access to me directly you can simply ask me to create
an External Table on the same data. I can use the data as I want and when done I will simply
drop my table leaving your data un-touched.
Placing userinfo to Movie table. Hive wont complain at loading as it apply schema on read.
Values of our partitioning columns wont be saved in files. When we fetch the value for partitioned
column, Hive will fetch from the directory name.
Predicate
****If we have both static and dynamic partitions, then static partition should come first.
Dynamic partition column order should be in the same order as when we created the table
In Hive, ORDER BY guarantees total ordering of data, but for that it has to be passed on to a single
reducer, which is normally performance intensive and therefore in strict mode, hive makes it compulsory
to use LIMIT with ORDER BY so that reducer doesn’t get overburdened.
Reducer :
1 emp_id | emp_salary
2 10 5000
3 11 4000
4 17 3100
5 16 3000
6 13 2600
7 14 2500
8 20 2000
9 19 1800
Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the
same Distribute By columns will go to the same reducer.
SORT BY
Doesn’t guarantee globally ordered data. Here output of each reducer is sorted.
When we don’t want globally sorted data, or when we have some logic in reducer which needs the
intermediate result to be sorted, we will go for Sorting.
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order
will be dependent on the column types. If the column is of numeric type, then the sort order is also in
numeric order. If the column is of string type, then the sort order will be lexicographical order.
Ordering : It orders data at each of ‘N’ reducers , but each reducer can have overlapping ranges of data.
Outcome : N or more sorted files with overlapping ranges.
Let’s understand with an example of below query:-
1 hive> SELECT emp_id, emp_salary FROM employees SORT BY emp_salary DESC;
Lets assume the number of reducers were set to 2 and output of each reducer is as follows –
Reducer 1 :
1 emp_id | emp_salary
2 10 5000
3 16 3000
4 13 2600
5 19 1800
Example ( taken directly from Hive wiki ):-
It can be created in the name of permanent table. But as long as temporary table is there, it will hide that
permanent table to be used.
Loading data with LOAD DATA Command:
If we load data from two location with same file name into same table without overwrite,