





















































In this article by Hanish Bansal, Saurabh Chauhan, and Shrey Mehrotra, the authors of the book, Apache Hive Cookbook, we will cover the following recipes:
Security is a major concern in all the big data frameworks. It is little complex to implement security in distributed systems because the components of different machines need to communicate with each other. It is very important to enable security on the data.
(For more resources related to this topic, see here.)
In today's era of big data, most organizations are concentrating to use Hadoop as a centralized data store. Data size is growing day by day, and organizations want to derive some insights and make decisions using the important information. While everyone is focusing on collecting the data, but having all the data at a centralized place increases the risk of data security. Securing the data access of Hadoop Distributed File System (HDFS) is very important. Hadoop security means restricting the access of data to only authorized users and groups. Further, when we talk about security, there are two major things—Authentication and Authorization.
The HDFS supports a permission model for files and directories that is much equivalent to standard POSIX model. Similar to UNIX permissions, each file and directory in HDFS are associated with an owner, a group, and other users. There are three types of permissions in HDFS: read, write, and execute.
In contrast to the UNIX permission model, there is no concept of executable files. So in case of files, read (r) permission is required to read a file, and write (w) permission is required to write or append to a file. In case of directories, read (r) permission is required to list the contents of directory, write (w) permission is required to create or delete the files or subdirectories, and execute (x) permission is required to access the child objects (files/subdirectories) of that directory. The following screenshot shows the level of access to each individual entity, namely OWNER, GROUP, and OTHER:
The Default HDFS Permission Model
As illustrated in the previous screenshot, by default, the permission set for the owner of files or directories is rwx (7), which means the owner of the file or directory is having full permission to read, write, and execute. For the members of group, the permission set is r-x, which means group members can only read and execute the files or directories and they cannot write or update anything in the files or directories. For other members, a permission set is same as a group, that is, other members can only read and execute the files or directories and they cannot write or update anything in files or directories.
Although this basic permission model is sufficient to handle a large number of security requirements at the block level, but using this model, you cannot define finer level security to specifically named users or groups. HDFS also has a feature to configure Access Control List (ACL), which can be used to define fine-grained permissions at file level as well as directory level for specifically named users or groups. For example, you want to give read access to users John, Mike, and Kate; then, HDFS ACLs can be used to define such kind of permissions.
HDFS ACLs are designed on the base concept of POSIX ACLs of UNIX systems.
First of all, you will need to enable ACLs in Hadoop. To enable ACL permissions, configure the following property in Hadoop-configure file named hdfs-site.xml located at <HADOOP_HOME>/etc/hadoop/hdfs-site.xml:
<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
</property>
There are two main commands that are used to configure ACLs: setfacl and getfacl. The command setfacl is used to set Finer Access Control Lists (FACL) for files or directories, and getfacl is used to retrieve Finer Access Control Lists (FACL) for files or directories.
Let's see how to use these commands:
hdfs dfs -setfacl [-R] [-b |-k -m |-x <acl_specification> <path>] |[--set <acl_specification> <path>]
The same command can be run using hadoop fs also, as follows:
hadoop fs -setfacl [-R] [-b |-k -m |-x <acl_specification> <path>] |[--set <acl_specification> <path>]
This command contains the following elements:
Now, let's see another command that is used to retrieve the ACLs:
hdfs dfs -getfacl [-R] <path>
This command can also be run using hadoop fs as follows:
hadoop fs -getfacl [-R] <path>
This command contains the following elements:
The command getfacl will list all default ACLs as well as new ACLs defined for specified files or directories.
If ACLs are defined for a file or directory, then while accessing that file/directory, access is validated as given in the following algorithm:
Let's assume that we have a file named stock-data containing stock market data. To retrieve all ACLs of this file, run the following command after which the output is shown in the screenshot given later:
$ hadoop fs -getfacl /stock-data
Because we have not defined any custom ACL for this file, as shown in the previous screenshot, command will return default ACL for this file.
You can check the permissions of a file or directory using the ls command also. As shown in the previous screenshot, the permission set for stock-data file is -rw-r--r--, which means read and write access for owner as well as read access for group members and others.
In the following command, we give read and write access to user named mike, and the result is shown in the following screenshot:
$ hadoop fs -setfacl -m user:mike:rw- /stock-data
As shown in the previous screenshot, first, we defined the ACLs for the user mike using setfacl command; then, we retrieved the ACLs using the getfacl command.
The output of the getfacl command will list out all default permissions as well as all ACLs. We defined ACLs for the user mike, so in the output, there is an extra row user:mike:rw-.
There is an extra row in the output mask::rw-, which defines special mask ACLs entry. Mask is a special type of ACLs that filters out the access for all named users, named groups, and unnamed groups. If you have not defined mask ACL, then its value is calculated using the union of all permissions.
In addition to this, the output of the ls command is also changed after defining ACLs. There is an extra plus (+) sign in permissions list that indicates that there are additional ACLs defined for this file or directory.
Revoking access of user mike. To remove a specific ACL -x option is used with the setfacl command:
$ hadoop fs -setfacl -x user:mike /stock-data
In the previous screenshot, after revoking access of user mike, ACLs are updated, and there is no entry for the user mike now.
You can read more about the permission model in Hadoop at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html.
Hive authorization is to verifying that a user is authorized to perform particular action. Authentication is about verifying the identity of a user, which is different from the authorization concept.
Hive can be used in the following different ways:
The following are the various ways of authorization in Hive:
Property | Value |
hive.metastore.pre.event.listeners | org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener |
hive.security.metastore.authorization.manager | org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider |
hive.security.metastore.authenticator.manager | org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator |
hive.security.metastore.authorization.auth.reads | true |
After setting all these configurations, Hive configuration file hive-site.xml will look as follows:
<configuration>
<property>
<name>hive.metastore.pre.event.listeners</name>
<value>org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener</value>
</property>
<property>
<name>hive.security.metastore.authorization.manager</name>
<value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider</value>
</property>
<property>
<name>hive.security.metastore.authenticator.manager</name>
<value>org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator</value>
</property>
<property>
<name>hive.security.metastore.authorization.auth.reads</name>
<value>true</value>
</property>
</configuration>
Many things can be done with SQL standard-based authorization. Use SQL standard-based authorization for more details.
In this article, we learned two different recipes Securing Hadoop and Authorizing Hive. You also learned different terminology of access permissions and their types. You went through various steps to secure Hadoop and learned different ways to perform authorization in Hive.