Ting Yu > Blogs >
Connect Hadoop Database by Using Hive in Python
posted Oct 11, 2014, 4:43 AM by Ting Yu [ updated Oct 22, 2014, 2:47 AM ]
On the Hadoop platform, there are two scripting languages that simplify the code: PIG is a specific scripting language,
HIVE looks like SQL. Using HIVE is quite easy. It has a bunch of extension functions (called user defined functions) to
transform data like regular expression tools and so on. A developer can add user defined functions, by developing
them in Java. Another way to have a procedural logic that complements SQL Set-based language is to use a language
like Python.
In this example, we use a Python module to access a database table. Hive is used to get the data, partition it and
send the rows to the Python processes which are created on the different cluster nodes.
In addition to the standard python program, a few libraries need to be installed to allow Python to build the connection
to the Hadoop databae.
1. Pyhs2, Python Hive Server 2 Client Driver: https://pypi.python.org/pypi/pyhs2/0.5.0
2. Sasl, Cyrus-SASL bindings for Python: https://pypi.python.org/pypi/sasl/0.1.3
3. Thrift, Python bindings for the Apache Thrift RPC system: https://pypi.python.org/pypi/thrift/0.9.1
4. PyHive, Python interface to Hive: https://pypi.python.org/pypi/PyHive/0.1.0
All the libraries are installed in the fold ~/site-packages. Installation Commands are below:
unzip pyhs2-master.zip
cd pyhs2-master
python setup.py install --user
tar zxvf sasl-0.1.3.tar.gz
cd sasl-0.1.3
python setup.py install –user
tar zxvf thrift-0.9.1.tar.gz
cd thrift-0.9.1
python setup.py install –user
tar zxvf PyHive-0.1.0.tar.gz
cd PyHive-0.1.0
python setup.py install --user
The main Python code to connect the database:
#!/usr/bin/env python
import pyhs2 as hive
import getpass
DEFAULT_DB = 'default'
DEFAULT_SERVER = '10.37.40.1'
DEFAULT_PORT = 10000
DEFAULT_DOMAIN = 'PAM01-PRD01.IBM.COM'
# Get the username and password
u = raw_input('Enter PAM username: ')
s = getpass.getpass()
# Build the Hive Connection
connection = hive.connect(host=DEFAULT_SERVER, port= DEFAULT_PORT, authMechanism='LDAP', user=u + '@' +
DEFAULT_DOMAIN, password=s)
# Hive query statement
statement = "select * from user_yuti.Temp_CredCard where pir_post_dt = '2014-05-01' limit 100"
cur = connection.cursor()
# Runs a Hive query and returns the result as a list of list
cur.execute(statement)
df = cur.fetchall()
Remember to change the permission of the executable
chmod +x test_hive2.py
./test_hive2.py
Commentaires
Vous n'êtes pas autorisé à ajouter des commentaires.
Afficher la version Ordinateur Mes sites
Avec la technologie de Google Sites