04 - IBM Watsonx - Data - Apache Spark

IBM watsonx.
data
Hands-on Lab -
Apache Spark
Lab Guide by:
Danny Arnold
Principal, Learning Content Development | Data & AI
darnold@us.ibm.com
Presenter:
Farah Auni Hisham
Technical Enablement Specialist | Data & AI
farah.hisham@ibm.com
1
Ecosystem Technical Enablement | Data & AI
watsonx.data
Apache Spark
Part 1 Brief Introduction

Part 2 Using Apache Spark within a Jupyter Notebook
2
Part 1
Brief Introduction
3
watsonx.data
Apache Spark

1.1 Apache Spark
1.2 Jupyter Notebook
4
watsonx.data
Apache Spark

1.1 Apache Spark
5
1.1 Apache Spark
Apache Spark is an open-source data-processing engine for large data sets. It is designed
to deliver the computational speed, scalability, and programmability required for big data, specifically for streaming data,
graph data, ML, and AI applications.
The basic Apache Spark architecture diagram:
6
1.1 Apache Spark
Spark has libraries that extend the capabilities to ML, AI, and stream processing.
• Apache Spark MLlib: The Apache Spark MLlib provides an out-of-the-box solution
• Spark Streaming: extension of the core Spark application programming interfaces (APIs) that enables scalable, fault-
tolerant processing of live data streams.
• Spark SQL: distributed query engine that provides low-latency, interactive queries up to 100x faster than MapReduce.
• Spark GraphX: designed to solve graph problems
7
watsonx.data
Apache Spark

1.1 Apache Spark
8
Jupyter notebooks are based on IPython which started in development in the 2006/7 timeframe. The existing Python
interpreter was limited in functionality and work was started to create a richer development environment. By 2011 the
development efforts resulted in IPython being released (http://blog.fperez.org/2012/01/ipython-notebook-
historical.html).
Jupyter notebooks were a spinoff (2014) from the original IPython project. IPython continues to be the kernel that
Jupyter runs on, but the notebooks are now a project on their own.
Jupyter notebooks run in a browser and communicate to the backend IPython server which renders this content. These
notebooks are used extensively by data scientists and anyone wanting to document, plot, and execute their code in an
interactive environment. The beauty of Jupyter notebooks is that you document what you do as you go along.
Ecosystem Technical Enablement | Data & AI 9

Part 2
Using Apache Spark within a
Jupyter Notebook
10
watsonx.data
Apache Spark

2.1 Environment Setup
2.2 Basics of Jupyter Notebook
2.3 Using Spark within watsonx.data
11
Public Service Announcement:
As this is a guided exercise, kindly follow the exact naming convention we

have here, to ensure the workshop is running smoothly.
Using different folder name, catalog names etc. might impact on the
subsequent steps.
To troubleshoot or amend the query will sometimes takes more time than
to re-start the hands-on lab steps.
Download PDF from Box folder for seamless experience.

watsonx.data
Apache Spark

13
1. Go to TechZone reservation page https://techzone.ibm.com/my/reservations and click on your reserved

tile. Alternatively, go to https://techzone.ibm.com/my/workshops/student/661c02131de802001e001249.
Wait until the Status: Ready.
2. Refer to ‘Published Services’ section in to access SSH command, Presto console, MinIO console and
watsonx.data UI.
3. Command & URL*:

SSH command:
• ssh -p <5 digits> watsonx@<server>.techzone-services.com
Jupyter Notebook – Server:
• http://<server>services.cloud.techzone.ibm.com: <5 digits> /notebooks/Table_of_Contents.ipynb
• *Use your own commands and URLs from

‘Published services’ section in your instance page
• No active reservation? Follow steps in

Appendix 1 Lab Reservation.
• Copy your list of Published services

somewhere like Notepad in case you
get logged out from Techzone! 14
4. Open command prompt and execute SSH command &

password:
SSH command*:
Are you sure you want to continue connecting?
• Yes
Password [Password input will be invisible]
• watsonx.data
5. Switch to root user & change directory to watsonx.data

product binaries:
Switch to root user

• sudo su –
Change directory
• cd /root/ibm-lh-dev/bin

1
6. Check container status.
Check status of the container

• ./status.sh --all
7. Access Jupyter Notebook – Server URL.

Open a web browser (Google Chrome and Mozilla Firefox have been tested and work with this lab)
Jupyter Notebook – Server*:

• http://<server>services.cloud.techzone.ibm.com: <5 digits> /notebooks/Table_of_Contents.ipynb

16
8. Run the command below:

Get server list and password/token:
• jupyter server list
9. Copy the token from the server list and paste in into the password/token field.
Click Log in. It should open the watsonx.data Sample Notebooks..
17
watsonx.data
Apache Spark

18
1. Click on the blue arrow icon → in “Introduction to Jupyter Notebooks” tile.
19

2. Go the “A Quick Tour section” and read through the various parts of Jupyter Notebook.
20
watsonx.data
Apache Spark

21
1. Go back to watsonx.data Sample Notebooks.

2. Scroll down to section “Accessing watsonx.data with Python, Pandas, and Apache Spark”.
3. Click on blue arrow icon → in “Accessing watsonx.data with Spark” tile.
22
4. The Jupyter Notebook for this lab that has the Spark code, is now displayed in the browser window. The second section,
watsonx.data Development Systems Updates, details all the items that were changed to allow Apache Spark to run in the
watsonx.data Development Lab image. Since this is the watsonx.data image that is being used for this lab exercise, these
changes are already in effect, and there is nothing you need to do to run Spark in this notebook.
23
Data foundation with Medallion Architecture
Better Data Quality, Governance
Raw Data
Bronze Silver Gold
Ingestion Tables Refined Tables Refined++ Tables

• No business rules or transformations • Prioritize speed to market and write • Prioritize business use cases and
of any kind. performance. user experience.
• Should be easy to get new data onto • Just enough (low volume) • Precalculated, business-specific
this layer. transactions. transformations/aggregations.
• Not so high performance. • Quality data expected. • Well defined schema.
• Interoperable format. • Highest performance
• Flexible schema. • High volume transactions.
Ecosystem Technical Enablement | Data & AI • Medium performance
24
5. Run each cells and read through the notes. You will notice that the last step is removing all tables and schema and deleting
the buckets we created earlier.
25
Appendix 1 Lab Reservation
First step before commencing watsonx hands-on exercise is reserving your lab environment.
For this workshop, there are 2 methods:
1) Reserving your own lab environment via IBM Technology Zone or commonly known as Techzone.
• Reserving your own lab via Techzone is always the most recommended option as this means that you may extend
your reservation up to 2 times.
• You are also able to access the environment at least 2 days after your reservation start time.
• You can proceed to reserve your own by Appendix Lab Reservation.
• Learning how to reserve your own environment in Techzone is essential for demo or even PoX.
• If you already have an active reservation of watsonx.data image in Techzone, you may skip this part and proceed to
1.1 Environment Setup.
2) Access pre-reserved lab environment (only available throughout workshop duration)

• This lab however only available for the duration of the workshop and the instance will be deleted afterwards
• If your own lab reservation failed or you would like to only access the lab during the workshop, you may proceed to
Appendix Access Pre-reserved Workshop.
27
watsonx.data Lab Reservation
1. You can reserve your own TechZone Lab for Self-Practice or

Client Demo purpose.
2. Go to watsonx.data developer base image in TechZone from 3
link below:
• https://techzone.ibm.com/collection/ibm-watsonxdata- 4
developer-base-image/environments
3. Go to Environment tab
4. Select IBM watsonx.data Development Lab (please do not
28
select POC version for the purpose of this lab!)
5. Select Reserve.
6. Reserve for now and fill in reservation form.
• Purpose: Practice / Self-Education
• Purpose description: Practice watsonx lab
• Preferred Geography: itz-watsonx – AMERICAS…
• VPN Access: Disable
7. Tick to agree with IBM Techzone T&C and policies and click
Submit.

Appendix 1 Lab Reservation
8. When your reservation is ready, you will receive email 9. Click on the reservation tile to see the published services.
notification. Click on the View My Reservation.
10. Published services will have the detail of your environment.
29
Appendix 2 Access Pre-reserved Workshop
1. Access IBM watsonx.data Workshop via this link:

• https://techzone.ibm.com/my/workshops/student/661c02131de802001e001249
2. Log in using your IBM ID

• IBM ID is a pre-requisite to access Techzone
3. Enter password below & click “Submit password/ access code”.

• Password: watsonx
30
Appendix 3 Restart container – Only to troubleshoot
1. Open command prompt and execute SSH command &

password:
SSH command*:
2. Switch to root user & change directory to watsonx.data

product binaries:
Switch to root user

• sudo su –
Change directory
3. Stop watsonx.data and wait until all components have been

stopped:
Stop container:
• ./stop

31
Appendix 3 Restart container – Only to troubleshoot
4. Start watsonx.data by running the following two commands:

Run mode:
• export LH_RUN_MODE=diag
Start container:
• ./start
5. It will take a few minutes for the various component

containers to start. Check the status of watsonx.data:
Check container status:
• ./status --all
32
Appendix 4 Removing the Db2 container’s password 90-day limit
If there is ‘password expired error’ in Db2 connection, it is due to Db2 container has a 90-day limit on the password.
Use the following method to remove the limit.
1. Open command prompt and execute SSH command & password:

SSH command*:
2. Switch to root user & change directory to watsonx.data product binaries:

Switch to root user
• sudo su –
Change directory
3. Change duration of Db2 server:

Run Docker command:
• docker exec db2server chage -I -1 -m 0 -M 99999 -E -1 db2inst1
33

04 - IBM Watsonx - Data - Apache Spark

Uploaded by

Copyright:

Available Formats

04 - IBM Watsonx - Data - Apache Spark

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04 - IBM Watsonx - Data - Apache Spark

Uploaded by

Copyright:

Available Formats

IBM watsonx.

Part 1 Brief Introduction

Part 1 Brief Introduction

Part 2 Using Apache Spark within a Jupyter Notebook

Part 1 Brief Introduction

Part 2 Using Apache Spark within a Jupyter Notebook

The basic Apache Spark architecture diagram:

Part 1 Brief Introduction

Part 2 Using Apache Spark within a Jupyter Notebook

Ecosystem Technical Enablement | Data & AI 9

Part 1 Brief Introduction

Part 2 Using Apache Spark within a Jupyter Notebook

As this is a guided exercise, kindly follow the exact naming convention we

Download PDF from Box folder for seamless experience.

Ecosystem Technical Enablement | Data & AI 12

Part 1 Brief Introduction

Part 2 Using Apache Spark within a Jupyter Notebook

1. Go to TechZone reservation page https://techzone.ibm.com/my/reservations and click on your reserved

3. Command & URL*:

• *Use your own commands and URLs from

• No active reservation? Follow steps in

• Copy your list of Published services

4. Open command prompt and execute SSH command &

5. Switch to root user & change directory to watsonx.data

Switch to root user

• *Use your own commands and URLs from

6. Check container status.

Check status of the container

7. Access Jupyter Notebook – Server URL.

Jupyter Notebook – Server*:

• *Use your own commands and URLs from

8. Run the command below:

Part 1 Brief Introduction

Part 2 Using Apache Spark within a Jupyter Notebook

1. Click on the blue arrow icon → in “Introduction to Jupyter Notebooks” tile.

Ecosystem Technical Enablement | Data & AI

Part 1 Brief Introduction

Part 2 Using Apache Spark within a Jupyter Notebook

1. Go back to watsonx.data Sample Notebooks.

Data foundation with Medallion Architecture

Better Data Quality, Governance

Ingestion Tables Refined Tables Refined++ Tables

2) Access pre-reserved lab environment (only available throughout workshop duration)

1. You can reserve your own TechZone Lab for Self-Practice or

Ecosystem Technical Enablement | Data & AI

10. Published services will have the detail of your environment.

1. Access IBM watsonx.data Workshop via this link:

2. Log in using your IBM ID

3. Enter password below & click “Submit password/ access code”.

1. Open command prompt and execute SSH command &

2. Switch to root user & change directory to watsonx.data

Switch to root user

3. Stop watsonx.data and wait until all components have been

• *Use your own commands and URLs from

4. Start watsonx.data by running the following two commands:

5. It will take a few minutes for the various component

1. Open command prompt and execute SSH command & password:

2. Switch to root user & change directory to watsonx.data product binaries:

3. Change duration of Db2 server: