04 - IBM Watsonx - Data - Apache Spark

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

IBM watsonx.

data
Hands-on Lab -
Apache Spark
Lab Guide by:
Danny Arnold
Principal, Learning Content Development | Data & AI
darnold@us.ibm.com

Presenter:
Farah Auni Hisham
Technical Enablement Specialist | Data & AI
farah.hisham@ibm.com

1
Ecosystem Technical Enablement | Data & AI
watsonx.data
Apache Spark

Part 1 Brief Introduction


Part 2 Using Apache Spark within a Jupyter Notebook

2
Ecosystem Technical Enablement | Data & AI
Part 1
Brief Introduction

3
watsonx.data
Apache Spark

Part 1 Brief Introduction


1.1 Apache Spark
1.2 Jupyter Notebook

Part 2 Using Apache Spark within a Jupyter Notebook

4
Ecosystem Technical Enablement | Data & AI
watsonx.data
Apache Spark

Part 1 Brief Introduction


1.1 Apache Spark
1.2 Jupyter Notebook

Part 2 Using Apache Spark within a Jupyter Notebook

5
Ecosystem Technical Enablement | Data & AI
1.1 Apache Spark

Apache Spark is an open-source data-processing engine for large data sets. It is designed
to deliver the computational speed, scalability, and programmability required for big data, specifically for streaming data,
graph data, ML, and AI applications.

The basic Apache Spark architecture diagram:

6
Ecosystem Technical Enablement | Data & AI
1.1 Apache Spark

Spark has libraries that extend the capabilities to ML, AI, and stream processing.

• Apache Spark MLlib: The Apache Spark MLlib provides an out-of-the-box solution
• Spark Streaming: extension of the core Spark application programming interfaces (APIs) that enables scalable, fault-
tolerant processing of live data streams.
• Spark SQL: distributed query engine that provides low-latency, interactive queries up to 100x faster than MapReduce.
• Spark GraphX: designed to solve graph problems

7
Ecosystem Technical Enablement | Data & AI
watsonx.data
Apache Spark

Part 1 Brief Introduction


1.1 Apache Spark
1.2 Jupyter Notebook

Part 2 Using Apache Spark within a Jupyter Notebook

8
Ecosystem Technical Enablement | Data & AI
1.2 Jupyter Notebook

Jupyter notebooks are based on IPython which started in development in the 2006/7 timeframe. The existing Python
interpreter was limited in functionality and work was started to create a richer development environment. By 2011 the
development efforts resulted in IPython being released (http://blog.fperez.org/2012/01/ipython-notebook-
historical.html).

Jupyter notebooks were a spinoff (2014) from the original IPython project. IPython continues to be the kernel that
Jupyter runs on, but the notebooks are now a project on their own.

Jupyter notebooks run in a browser and communicate to the backend IPython server which renders this content. These
notebooks are used extensively by data scientists and anyone wanting to document, plot, and execute their code in an
interactive environment. The beauty of Jupyter notebooks is that you document what you do as you go along.

Ecosystem Technical Enablement | Data & AI 9


Part 2
Using Apache Spark within a
Jupyter Notebook

10
watsonx.data
Apache Spark

Part 1 Brief Introduction

Part 2 Using Apache Spark within a Jupyter Notebook


2.1 Environment Setup
2.2 Basics of Jupyter Notebook
2.3 Using Spark within watsonx.data

11
Ecosystem Technical Enablement | Data & AI
Public Service Announcement:

As this is a guided exercise, kindly follow the exact naming convention we


have here, to ensure the workshop is running smoothly.

Using different folder name, catalog names etc. might impact on the
subsequent steps.

To troubleshoot or amend the query will sometimes takes more time than
to re-start the hands-on lab steps.

Download PDF from Box folder for seamless experience.

Ecosystem Technical Enablement | Data & AI 12


watsonx.data
Apache Spark

Part 1 Brief Introduction

Part 2 Using Apache Spark within a Jupyter Notebook


2.1 Environment Setup
2.2 Basics of Jupyter Notebook
2.3 Using Spark within watsonx.data

13
Ecosystem Technical Enablement | Data & AI
2.1 Environment Setup

1. Go to TechZone reservation page https://techzone.ibm.com/my/reservations and click on your reserved


tile. Alternatively, go to https://techzone.ibm.com/my/workshops/student/661c02131de802001e001249.
Wait until the Status: Ready.

2. Refer to ‘Published Services’ section in to access SSH command, Presto console, MinIO console and
watsonx.data UI.

3. Command & URL*:


SSH command:
• ssh -p <5 digits> watsonx@<server>.techzone-services.com
Jupyter Notebook – Server:
• http://<server>services.cloud.techzone.ibm.com: <5 digits> /notebooks/Table_of_Contents.ipynb

• *Use your own commands and URLs from


‘Published services’ section in your instance page

• No active reservation? Follow steps in


Appendix 1 Lab Reservation.

• Copy your list of Published services


somewhere like Notepad in case you
get logged out from Techzone! 14
Ecosystem Technical Enablement | Data & AI
2.1 Environment Setup

4. Open command prompt and execute SSH command &


password:

SSH command*:
• ssh -p <5 digits> watsonx@<server>.techzone-services.com
Are you sure you want to continue connecting?
• Yes
Password [Password input will be invisible]
• watsonx.data

5. Switch to root user & change directory to watsonx.data


product binaries:

Switch to root user


• sudo su –
Change directory
• cd /root/ibm-lh-dev/bin

• *Use your own commands and URLs from


‘Published services’ section in your instance page

1
Ecosystem Technical Enablement | Data & AI 5
2.1 Environment Setup

6. Check container status.

Check status of the container


• ./status.sh --all

7. Access Jupyter Notebook – Server URL.


Open a web browser (Google Chrome and Mozilla Firefox have been tested and work with this lab)

Jupyter Notebook – Server*:


• http://<server>services.cloud.techzone.ibm.com: <5 digits> /notebooks/Table_of_Contents.ipynb

• *Use your own commands and URLs from


‘Published services’ section in your instance page

16
Ecosystem Technical Enablement | Data & AI
2.1 Environment Setup

8. Run the command below:


Get server list and password/token:
• jupyter server list

9. Copy the token from the server list and paste in into the password/token field.
Click Log in. It should open the watsonx.data Sample Notebooks..

17
Ecosystem Technical Enablement | Data & AI
watsonx.data
Apache Spark

Part 1 Brief Introduction

Part 2 Using Apache Spark within a Jupyter Notebook


2.1 Environment Setup
2.2 Basics of Jupyter Notebook
2.3 Using Spark within watsonx.data

18
Ecosystem Technical Enablement | Data & AI
2.2 Basics of Jupyter Notebook

1. Click on the blue arrow icon → in “Introduction to Jupyter Notebooks” tile.

19

Ecosystem Technical Enablement | Data & AI


2.2 Basics of Jupyter Notebook

2. Go the “A Quick Tour section” and read through the various parts of Jupyter Notebook.

20
Ecosystem Technical Enablement | Data & AI
watsonx.data
Apache Spark

Part 1 Brief Introduction

Part 2 Using Apache Spark within a Jupyter Notebook


2.1 Environment Setup
2.2 Basics of Jupyter Notebook
2.3 Using Spark within watsonx.data

21
Ecosystem Technical Enablement | Data & AI
2.3 Using Spark within watsonx.data

1. Go back to watsonx.data Sample Notebooks.


2. Scroll down to section “Accessing watsonx.data with Python, Pandas, and Apache Spark”.
3. Click on blue arrow icon → in “Accessing watsonx.data with Spark” tile.

22
Ecosystem Technical Enablement | Data & AI
2.3 Using Spark within watsonx.data

4. The Jupyter Notebook for this lab that has the Spark code, is now displayed in the browser window. The second section,
watsonx.data Development Systems Updates, details all the items that were changed to allow Apache Spark to run in the
watsonx.data Development Lab image. Since this is the watsonx.data image that is being used for this lab exercise, these
changes are already in effect, and there is nothing you need to do to run Spark in this notebook.

23
Ecosystem Technical Enablement | Data & AI
2.3 Using Spark within watsonx.data

Data foundation with Medallion Architecture

Better Data Quality, Governance

Raw Data
Bronze Silver Gold

Ingestion Tables Refined Tables Refined++ Tables


• No business rules or transformations • Prioritize speed to market and write • Prioritize business use cases and
of any kind. performance. user experience.
• Should be easy to get new data onto • Just enough (low volume) • Precalculated, business-specific
this layer. transactions. transformations/aggregations.
• Not so high performance. • Quality data expected. • Well defined schema.
• Interoperable format. • Highest performance
• Flexible schema. • High volume transactions.
Ecosystem Technical Enablement | Data & AI • Medium performance
24
2.3 Using Spark within watsonx.data

5. Run each cells and read through the notes. You will notice that the last step is removing all tables and schema and deleting
the buckets we created earlier.

25
Ecosystem Technical Enablement | Data & AI
Appendix 1 Lab Reservation

First step before commencing watsonx hands-on exercise is reserving your lab environment.
For this workshop, there are 2 methods:

1) Reserving your own lab environment via IBM Technology Zone or commonly known as Techzone.
• Reserving your own lab via Techzone is always the most recommended option as this means that you may extend
your reservation up to 2 times.
• You are also able to access the environment at least 2 days after your reservation start time.
• You can proceed to reserve your own by Appendix Lab Reservation.
• Learning how to reserve your own environment in Techzone is essential for demo or even PoX.
• If you already have an active reservation of watsonx.data image in Techzone, you may skip this part and proceed to
1.1 Environment Setup.

2) Access pre-reserved lab environment (only available throughout workshop duration)


• This lab however only available for the duration of the workshop and the instance will be deleted afterwards
• If your own lab reservation failed or you would like to only access the lab during the workshop, you may proceed to
Appendix Access Pre-reserved Workshop.

27
Ecosystem Technical Enablement | Data & AI
watsonx.data Lab Reservation

1. You can reserve your own TechZone Lab for Self-Practice or


Client Demo purpose.
2. Go to watsonx.data developer base image in TechZone from 3

link below:
• https://techzone.ibm.com/collection/ibm-watsonxdata- 4

developer-base-image/environments
3. Go to Environment tab
4. Select IBM watsonx.data Development Lab (please do not
28
select POC version for the purpose of this lab!)
5. Select Reserve.
6. Reserve for now and fill in reservation form.
• Purpose: Practice / Self-Education
• Purpose description: Practice watsonx lab
• Preferred Geography: itz-watsonx – AMERICAS…
• VPN Access: Disable
7. Tick to agree with IBM Techzone T&C and policies and click
Submit.

Ecosystem Technical Enablement | Data & AI


Appendix 1 Lab Reservation

8. When your reservation is ready, you will receive email 9. Click on the reservation tile to see the published services.
notification. Click on the View My Reservation.

10. Published services will have the detail of your environment.

29
Ecosystem Technical Enablement | Data & AI
Appendix 2 Access Pre-reserved Workshop

1. Access IBM watsonx.data Workshop via this link:


• https://techzone.ibm.com/my/workshops/student/661c02131de802001e001249

2. Log in using your IBM ID


• IBM ID is a pre-requisite to access Techzone

3. Enter password below & click “Submit password/ access code”.


• Password: watsonx

30
Ecosystem Technical Enablement | Data & AI
Appendix 3 Restart container – Only to troubleshoot

1. Open command prompt and execute SSH command &


password:
SSH command*:
• ssh -p <5 digits> watsonx@<server>.techzone-services.com

2. Switch to root user & change directory to watsonx.data


product binaries:

Switch to root user


• sudo su –
Change directory
• cd /root/ibm-lh-dev/bin

3. Stop watsonx.data and wait until all components have been


stopped:
Stop container:
• ./stop

• *Use your own commands and URLs from


‘Published services’ section in your instance page

31
Ecosystem Technical Enablement | Data & AI
Appendix 3 Restart container – Only to troubleshoot

4. Start watsonx.data by running the following two commands:


Run mode:
• export LH_RUN_MODE=diag
Start container:
• ./start

5. It will take a few minutes for the various component


containers to start. Check the status of watsonx.data:
Check container status:
• ./status --all

32
Ecosystem Technical Enablement | Data & AI
Appendix 4 Removing the Db2 container’s password 90-day limit

If there is ‘password expired error’ in Db2 connection, it is due to Db2 container has a 90-day limit on the password.
Use the following method to remove the limit.

1. Open command prompt and execute SSH command & password:


SSH command*:
• ssh -p <5 digits> watsonx@<server>.techzone-services.com

2. Switch to root user & change directory to watsonx.data product binaries:


Switch to root user
• sudo su –
Change directory
• cd /root/ibm-lh-dev/bin

3. Change duration of Db2 server:


Run Docker command:
• docker exec db2server chage -I -1 -m 0 -M 99999 -E -1 db2inst1

33
Ecosystem Technical Enablement | Data & AI

You might also like