04 - IBM Watsonx - Data - Apache Spark
04 - IBM Watsonx - Data - Apache Spark
04 - IBM Watsonx - Data - Apache Spark
data
Hands-on Lab -
Apache Spark
Lab Guide by:
Danny Arnold
Principal, Learning Content Development | Data & AI
darnold@us.ibm.com
Presenter:
Farah Auni Hisham
Technical Enablement Specialist | Data & AI
farah.hisham@ibm.com
1
Ecosystem Technical Enablement | Data & AI
watsonx.data
Apache Spark
2
Ecosystem Technical Enablement | Data & AI
Part 1
Brief Introduction
3
watsonx.data
Apache Spark
4
Ecosystem Technical Enablement | Data & AI
watsonx.data
Apache Spark
5
Ecosystem Technical Enablement | Data & AI
1.1 Apache Spark
Apache Spark is an open-source data-processing engine for large data sets. It is designed
to deliver the computational speed, scalability, and programmability required for big data, specifically for streaming data,
graph data, ML, and AI applications.
6
Ecosystem Technical Enablement | Data & AI
1.1 Apache Spark
Spark has libraries that extend the capabilities to ML, AI, and stream processing.
• Apache Spark MLlib: The Apache Spark MLlib provides an out-of-the-box solution
• Spark Streaming: extension of the core Spark application programming interfaces (APIs) that enables scalable, fault-
tolerant processing of live data streams.
• Spark SQL: distributed query engine that provides low-latency, interactive queries up to 100x faster than MapReduce.
• Spark GraphX: designed to solve graph problems
7
Ecosystem Technical Enablement | Data & AI
watsonx.data
Apache Spark
8
Ecosystem Technical Enablement | Data & AI
1.2 Jupyter Notebook
Jupyter notebooks are based on IPython which started in development in the 2006/7 timeframe. The existing Python
interpreter was limited in functionality and work was started to create a richer development environment. By 2011 the
development efforts resulted in IPython being released (http://blog.fperez.org/2012/01/ipython-notebook-
historical.html).
Jupyter notebooks were a spinoff (2014) from the original IPython project. IPython continues to be the kernel that
Jupyter runs on, but the notebooks are now a project on their own.
Jupyter notebooks run in a browser and communicate to the backend IPython server which renders this content. These
notebooks are used extensively by data scientists and anyone wanting to document, plot, and execute their code in an
interactive environment. The beauty of Jupyter notebooks is that you document what you do as you go along.
10
watsonx.data
Apache Spark
11
Ecosystem Technical Enablement | Data & AI
Public Service Announcement:
Using different folder name, catalog names etc. might impact on the
subsequent steps.
To troubleshoot or amend the query will sometimes takes more time than
to re-start the hands-on lab steps.
13
Ecosystem Technical Enablement | Data & AI
2.1 Environment Setup
2. Refer to ‘Published Services’ section in to access SSH command, Presto console, MinIO console and
watsonx.data UI.
SSH command*:
• ssh -p <5 digits> watsonx@<server>.techzone-services.com
Are you sure you want to continue connecting?
• Yes
Password [Password input will be invisible]
• watsonx.data
1
Ecosystem Technical Enablement | Data & AI 5
2.1 Environment Setup
16
Ecosystem Technical Enablement | Data & AI
2.1 Environment Setup
9. Copy the token from the server list and paste in into the password/token field.
Click Log in. It should open the watsonx.data Sample Notebooks..
17
Ecosystem Technical Enablement | Data & AI
watsonx.data
Apache Spark
18
Ecosystem Technical Enablement | Data & AI
2.2 Basics of Jupyter Notebook
19
2. Go the “A Quick Tour section” and read through the various parts of Jupyter Notebook.
20
Ecosystem Technical Enablement | Data & AI
watsonx.data
Apache Spark
21
Ecosystem Technical Enablement | Data & AI
2.3 Using Spark within watsonx.data
22
Ecosystem Technical Enablement | Data & AI
2.3 Using Spark within watsonx.data
4. The Jupyter Notebook for this lab that has the Spark code, is now displayed in the browser window. The second section,
watsonx.data Development Systems Updates, details all the items that were changed to allow Apache Spark to run in the
watsonx.data Development Lab image. Since this is the watsonx.data image that is being used for this lab exercise, these
changes are already in effect, and there is nothing you need to do to run Spark in this notebook.
23
Ecosystem Technical Enablement | Data & AI
2.3 Using Spark within watsonx.data
Raw Data
Bronze Silver Gold
5. Run each cells and read through the notes. You will notice that the last step is removing all tables and schema and deleting
the buckets we created earlier.
25
Ecosystem Technical Enablement | Data & AI
Appendix 1 Lab Reservation
First step before commencing watsonx hands-on exercise is reserving your lab environment.
For this workshop, there are 2 methods:
1) Reserving your own lab environment via IBM Technology Zone or commonly known as Techzone.
• Reserving your own lab via Techzone is always the most recommended option as this means that you may extend
your reservation up to 2 times.
• You are also able to access the environment at least 2 days after your reservation start time.
• You can proceed to reserve your own by Appendix Lab Reservation.
• Learning how to reserve your own environment in Techzone is essential for demo or even PoX.
• If you already have an active reservation of watsonx.data image in Techzone, you may skip this part and proceed to
1.1 Environment Setup.
27
Ecosystem Technical Enablement | Data & AI
watsonx.data Lab Reservation
link below:
• https://techzone.ibm.com/collection/ibm-watsonxdata- 4
developer-base-image/environments
3. Go to Environment tab
4. Select IBM watsonx.data Development Lab (please do not
28
select POC version for the purpose of this lab!)
5. Select Reserve.
6. Reserve for now and fill in reservation form.
• Purpose: Practice / Self-Education
• Purpose description: Practice watsonx lab
• Preferred Geography: itz-watsonx – AMERICAS…
• VPN Access: Disable
7. Tick to agree with IBM Techzone T&C and policies and click
Submit.
8. When your reservation is ready, you will receive email 9. Click on the reservation tile to see the published services.
notification. Click on the View My Reservation.
29
Ecosystem Technical Enablement | Data & AI
Appendix 2 Access Pre-reserved Workshop
30
Ecosystem Technical Enablement | Data & AI
Appendix 3 Restart container – Only to troubleshoot
31
Ecosystem Technical Enablement | Data & AI
Appendix 3 Restart container – Only to troubleshoot
32
Ecosystem Technical Enablement | Data & AI
Appendix 4 Removing the Db2 container’s password 90-day limit
If there is ‘password expired error’ in Db2 connection, it is due to Db2 container has a 90-day limit on the password.
Use the following method to remove the limit.
33
Ecosystem Technical Enablement | Data & AI