





















































(For more resources related to this topic, see here.)
The first thing we need to do is to download the latest version of Sqoop from following location http://www.apache.org/dist/sqoop/ and extract it on your machine. Now I am calling the Sqoop installation dir as $SQOOP_HOME.
Given here are the prerequisites for Sqoop import process.
Installed and running Relational Database Management System (MySQL).
Installed and running Hadoop Cluster.
Set $HADOOP_HOME environment variable.
Following are the common arguments of import process.
Parameters |
Description |
--connect <jdbc-uri> |
This command specifies the server or database to connect. It also specifies the port. Example: --connect jdbc:mysql://host:port/databaseName |
--connection-manager <class-name>
|
Specify connection manager class name.
|
--driver <class-name> |
Specify the fully qualified name of JDBC driver class. |
--password <password> |
Set authentication password required to connect to input source. |
--username <username> |
Set authentication username. |
Let’s see how to work with import process
First, we will start with import single RDBMS table into Hadoop.
$ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1
--username root --password password --table tableName
--target-dir /user/abc/tableName
The content of output file in HDFS will look like:
Next, we will put some light on approach of import only selected rows and selected columns of RDBMS table into Hadoop.
$ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1
--username root --password password --table student --target-
dir /user/abc/student --columns “student_id,address,name”
$ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1
--username root --password password --table student --target-
dir /user/abc/student --where ‘student_id<100’
$ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1
--username root --password password --table student --target-
dir /user/abc/student --columns “student_id,address,name”
-- where ‘student_id<100’
Now let’s see how the above steps work:
Apart from the common arguments of import process, as explained previously, this part covers some other arguments which are required to import a table into Hadoop Distributed File System.
Parameters |
Description |
--table <table-name> |
Name of input table to fetch. |
--target-dir<dir> |
Location of output/target dir in HDFS. |
--direct |
If user want to use non-JDBC based access mechanism for faster database access |
--options-file <file-path> |
All the command line options that are common in most of commands can put in options file for convenience. |
The Query1 will run a MapReduce job and import all the rows of given table to HDFS (where, /user/abc/tableName is the location of output files). The records imported in HDFS preserve their original columns order, which means, if input table contains four columns A, B, C and D, then content in HDFS file will look like:
A1, B1, C1, D1 A2, B2, C2, D2
By default, the import query will select all columns of input table for import, but we can select the subset of columns by specifying the comma separated list of columns in --columns argument.
The Query2 will only fetch three columns (student_id, address and name) of student table. If import query contains the --columns argument, then the order of column in output files are same as order specified in --columns argument. The output in HDFS will look like:
student_id, address, name 1, Delhi, XYZ 2, Mumbai, PQR ..........
If the input query contains the column in following order -- “address, name, student_id”, then the output in HDFS will look like.
address, name, student_id Delhi, XYZ, 1 Mumbai, PQR, 2 .............
By default, all the rows of input table will be imported to HDFS, but we can control which rows need to be import by using a --where argument in the import statement.
The Query3 will import only those rows into HDFS which has value of “student_id” column greater than 100.
The Query4 use both --columns and --where arguments in one statement. For Query4, Sqoop will internally generates the query of the form “select student_id, address, name from student where student_id<100”.
This section covers some more examples of import process.
So far we have imported a single table into HDFS, this section introduces an import-all-tables tool, by which we can import a set of tables from an RDBMS to HDFS. The import-all-tables tool creates a separate directory in HDFS for each RDBMS table. The following are the mandatory conditions for import-all-tables tool:
All tables must have a single primary key column. User must intend to import all the columns of each table. No --where, --columns and --query arguments are permitted.
$ bin/sqoop import-all-tables --connect jdbc:mysql://localhost:3306/db1
--username root --password password
This query will import all tables (tableName and tableName1) of database db1 into HDFS.
Output directories in HDFS look like:
We learned a lot in this article, about import single RDBMS table into HDFS, import selected columns and selected rows, and import set of RDBMS tables.
Further resources on this subject: