Sqoop Interview Questions
Sqoop Interview Questions
What will happen if target directory already exists during sqoop import?
Ans: Sqoop runs a map-only job and if the target directory is present, it will throw an exception.
We wish to bring data in CSV format in HDFS from RDBMS source. The column in
RDBMS table contains ‘,’. How to distinctly import data in this case?
Ans: Use can use the option –optionally-enclosed-by
Incremental load can be performed by using Sqoop import command or by loading the data into hive
without overwriting it. The different attributes that need to be specified during incremental load in
Sqoop are-
1)Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode
can have value as Append or Last Modified.
2)Col (Check-column) –This attribute specifies the column that should be examined to find out the rows
to be imported.
3)Value (last-value) –This denotes the maximum value of the check column from the previous import
operation.
Large objects in Sqoop are handled by importing the large objects into a file referred as “LobFile” i.e.
Large Object File. The LobFile has the ability to store records of huge size, thus each record in the LobFile
is a large object.
Q What is a disadvantage of using –direct parameter for faster data load by sqoop?
Answer: The native utilities used by databases to support faster load do not work for binary data
formats like SequenceFile
Q How can you check all the tables present in a single database using Sqoop?
Answer: The command to check the list of all tables present in a single database using Sqoop is as
follows-
Q How can you control the number of mappers used by the sqoop command?
Answer: The Parameter –num-mappers is used to control the number of mappers executed by a sqoop
command. We should start with choosing a small number of map tasks and then gradually scale up as
choosing high number of mappers initially may slow down the performance on the database side.
Q How can we import a subset of rows from a table without using the where clause?
Answer: We can run a filtering query on the database and save the result to a temporary table in
database.
Then use the sqoop import command without using the –where clause
Q When the source data keeps getting updated frequently, what is the approach to keep it
in sync with the data in HDFS imported by sqoop?
Answer: sqoop can have 2 approaches.
a − To use the –incremental parameter with append option where value of some columns are checked
and only in case of modified values the row is imported as a new row.
b − To use the –incremental parameter with lastmodified option where a date column in the source is
checked for records which have been updated after the last import.
Q Can free form SQL queries be used with Sqoop import command? If yes, then how can
they be used?
Answer: Sqoop allows us to use free form SQL queries with the import command. The import command
should be used with the –e and – query options to execute free form SQL queries. When using the –e
and –query options with the import command the –target dir value must be specified.
–Columns
–Where
Q Can free form SQL queries be used with Sqoop import command? If yes, then how can
they be used?
Answer: Sqoop allows us to use free form SQL queries with the import command. The import command
should be used with the –e and – query options to execute free form SQL queries. When using the –e
and –query options with the import command the –target dir value must be specified.
Q How can you see the list of stored jobs in sqoop metastore?
Answer: sqoop job –list
Q What is the default extension of the files produced from a sqoop import using the –
compress parameter?
Answer: .gz
Q What are the limitations of importing RDBMS tables into Hcatalog directly?
Answer: There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog –
database option with the –hcatalog –table but the limitation to it is that there are several arguments like
–as-avro file , -direct, -as-sequencefile, -target-dir , -export-dir are not supported.
For example the –connect parameter’s value and –user name value scan be stored in a file and used
again and again with different sqoop commands.
Q What are the two file formats supported by sqoop for import?
Answer: Delimited text and Sequence Files.
Q while loading table from MySQL into HDFS, if we need to copy tables with maximum
possible speed, what can you do?
Answer: We need to use -direct argument in import command to use direct import fast path and this -
direct can be used only with MySQL and PostGreSQL as of now.
Q How can you sync a exported table with HDFS data in which some rows are deleted?
Answer: Truncate the target table and load it again.
Q How do you clear the data in a staging table before loading it by Sqoop?
Answer: By specifying the –clear-staging-table option we can clear the staging table before it is loaded.
This can be done again and again till we get proper data in staging.