Ibm Infosphere Data Replication'S 11.3.3.1 Change Data Capture (CDC) Webhdfs
Ibm Infosphere Data Replication'S 11.3.3.1 Change Data Capture (CDC) Webhdfs
1
Change Data Capture (CDC) WebHDFS
TCP/IP
Transport https
Cluster
WebHDFS support utilizes Rest APIs
– CDC install is outside of the Hadoop cluster which provides the following benefits:
• When a cluster node fails, CDC will not be affected
• Changes/upgrades of the Hadoop cluster will not impact the server where the
CDC target engine is running
– Allows CDC to target any Hadoop distribution
3 © 2015 IBM Corporation
WebHDFS Support Overview
Note: The CDC WebHDFS target engine is packaged with the CDC
DataStage target engine. DataStage is not used as part of the solution.
5 © 2015 IBM Corporation
WebHDFS Create CDC Instance
The first step after installing the CDC DataStage target engine is
to create the WebHDFS instance
Note that the fully qualified connection string must be supplied including the /webhdfs/v1/
Additional examples by Hadoop service for WebHDFS with default configuration:
– Through HttpFS proxy : BigInsights 3.0
• http://<HOSTNAME>:14000/webhdfs/v1/
• https://<HOSTNAME>:14443/webhdfs/v1/
– Through Knox gateway : BigInsights 4.0
• https://<HOSTNAME>:8443/gateway/default/webhdfs/v1/
– Directly to HDFS NameNode : rarely permitted in production
• http://<HOSTNAME>:50070/webhdfs/v1/
• https://<HOSTNAME>:50470/webhdfs/v1/
10 © 2015 IBM Corporation
Configuring Hadoop Properties for the
Subscription…
The following illustrates the configuration to utilize Kerberos authentication:
Note that the fully qualified connection string must be supplied including the
/webhdfs/v1/
Principal:
– Specify the Default Principal (which can be displayed using klist)
Keytab Path:
– Specify the fully qualified path with the keytab name
– (_)[Table].D[Date].[Time][# Records]
• _ = Currently open HDFS file. Removed when completed
• [Date] = Julian date (year, day number within year)
• [Time] = hh24mmss when flat file was created (in GMT)
• [# Records] = Optionally the number of records can be added
For those who are familiar with standard IIDR flat file production, there are
some behavior difference with IIDR HDFS files compared with standard flat
file production
– File prefix is different
• HDFS uses _ instead of @ for working file
– Fields are not quoted in files produced in HFDS
– HDFS doesn’t create [Table].STOPPED file when subscription is stopped
“When” “What”
Single record
– In this format an update operation is sent as a single row
– The before and after image is contained in the same record
https://HOSTNAME:PORT/gateway/default/webhdfs/v1/
Format
– hdfs dfs -<command> for example hdfs dfs -ls /
– hadoop fs -<command> for example hadoop fs -ls /
ls, cat, chmod and chown are frequently used during initial test and basic
troubleshooting
For more detail usage of curl with WebHDFS, reference hadoop document.
Following URL is for hadoop V2.7.1
– https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/WebHDFS.html
{"FileStatus":{"accessTime":0,"blockSize":0,"childrenNum":0,"fileId":23630,"group":"hdfs","length":0,"modificationTime":1444792788155,"owner":
"hdfs","pathSuffix":"","permission":"777","replication":0,"storagePolicy":0,"type":"DIRECTORY"}}
{"boolean":true}
[hdfs@bigdata ~]$ hadoop fs -ls /user/hdfs/curltest
Found 1 items
drwxr-xr-x - dr.who hdfs 0 2015-10-14 12:26 /user/hdfs/curltest/direct
/home/khjang$ klist
Ticket cache: FILE:/tmp/krb5cc_2636
Default principal: biadmin/cdclnxy.canlab.ibm.com@CANLAB.IBM.COM
Get file status by using Kerberos authentication that was initialized by kinit
/home/khjang$ curl -i --negotiate -u : http://cdclnxy.canlab.ibm.com:14000/webhdfs/v1/user/?op=GETFILESTATUS
HTTP/1.1 401
WWW-Authenticate: Negotiate
Set-Cookie: hadoop.auth=;Path=/;Expires=Thu, 01-Jan-1970 00:00:00 GMT
Content-Type: text/html; charset=iso-8859-1
Cache-Control: must-revalidate,no-cache,no-store
Content-Length: 1363
Server: Jetty(6.1.x)
HTTP/1.1 200 OK
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Set-Cookie: hadoop.auth="u=biadmin&p=biadmin/cdclnxy.canlab.ibm.com@CANLAB.IBM.COM&t=kerberos-
dt&e=1446051015925&s=q75UMQvMt9AWnJFalzVdj8/94+E=";Path=/
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.x)
{"FileStatus":{"pathSuffix":"","type":"DIRECTORY","length":0,"owner":"hdfs","group":"biadmin","permission":"777","accessTime":0,"modificationT
ime":1446014900992,"blockSize":0,"replication":0}}
HTTP/1.1 200 OK
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Set-Cookie: hadoop.auth="u=biadmin&p=biadmin/cdclnxy.canlab.ibm.com@CANLAB.IBM.COM&t=kerberos-
dt&e=1446050900934&s=xNPXl2Qv1ss7Aj2Zviepfsi4rjc=";Path=/
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.x)
{"boolean":true}
HTTP/1.1 200 OK
Expires: Thu, 01-Jan-1970 00:00:00 GMT
Set-Cookie: hadoop.auth="u=biadmin&p=biadmin/cdclnxy.canlab.ibm.com@CANLAB.IBM.COM&t=kerberos-
dt&e=1446114369078&s=lvgSUa2eGMtKsqlTYEEhKETVOi8=";Path=/
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.x)
{"FileStatus":{"pathSuffix":"","type":"DIRECTORY","length":0,"owner":"biadmin","group":"biadmin","permission":"755","accessTime":0,"modificati
onTime":1446078023426,"blockSize":0,"replication":0}}
{"FileStatus":{"accessTime":0,"blockSize":0,"childrenNum":0,"fileId":23630,"group":"hdfs","length":0,"modificationTime":1444792788155,"owner":
"hdfs","pathSuffix":"","permission":"777","replication":0,"storagePolicy":0,"type":"DIRECTORY"}}
{"boolean":true}
[hdfs@bigdata ~]$ hadoop fs -ls /user/hdfs/curltest
Found 2 items
drwxr-xr-x - dr.who hdfs 0 2015-10-14 12:26 /user/hdfs/curltest/direct
drwxr-xr-x - guest hdfs 0 2015-10-14 12:28 /user/hdfs/curltest/knox
When replication throws the following error, check time gap between Hadoop
cluster and IIDR CDC Server and fix it first
– Kerberos throws this error if time gap is more than 5 minutes.
If time gap is not the case, need to validate keytab and principle again
CDC Redbook:
– http://www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg247941.html?Open
Passport Advantage:
– https://www-112.ibm.com/software/howtobuy/softwareandservices/passportadvantage