In this article by Hanish Bansal, Saurabh Chauhan, and Shrey Mehrotra, the authors of the book, Apache Hive Cookbook, we will cover the following recipes:

Securing Hadoop

Authorizing Hive

Security is a major concern in all the big data frameworks. It is little complex to implement security in distributed systems because the components of different machines need to communicate with each other. It is very important to enable security on the data.

(For more resources related to this topic, see here.)

Securing Hadoop

In today's era of big data, most organizations are concentrating to use Hadoop as a centralized data store. Data size is growing day by day, and organizations want to derive some insights and make decisions using the important information. While everyone is focusing on collecting the data, but having all the data at a centralized place increases the risk of data security. Securing the data access of Hadoop Distributed File System (HDFS) is very important. Hadoop security means restricting the access of data to only authorized users and groups. Further, when we talk about security, there are two major things—Authentication and Authorization.

The HDFS supports a permission model for files and directories that is much equivalent to standard POSIX model. Similar to UNIX permissions, each file and directory in HDFS are associated with an owner, a group, and other users. There are three types of permissions in HDFS: read, write, and execute.

In contrast to the UNIX permission model, there is no concept of executable files. So in case of files, read (r) permission is required to read a file, and write (w) permission is required to write or append to a file. In case of directories, read (r) permission is required to list the contents of directory, write (w) permission is required to create or delete the files or subdirectories, and execute (x) permission is required to access the child objects (files/subdirectories) of that directory. The following screenshot shows the level of access to each individual entity, namely OWNER, GROUP, and OTHER:

hive-security-img-0 The Default HDFS Permission Model

As illustrated in the previous screenshot, by default, the permission set for the owner of files or directories is rwx (7), which means the owner of the file or directory is having full permission to read, write, and execute. For the members of group, the permission set is r-x, which means group members can only read and execute the files or directories and they cannot write or update anything in the files or directories. For other members, a permission set is same as a group, that is, other members can only read and execute the files or directories and they cannot write or update anything in files or directories.

Although this basic permission model is sufficient to handle a large number of security requirements at the block level, but using this model, you cannot define finer level security to specifically named users or groups. HDFS also has a feature to configure Access Control List (ACL), which can be used to define fine-grained permissions at file level as well as directory level for specifically named users or groups. For example, you want to give read access to users John, Mike, and Kate; then, HDFS ACLs can be used to define such kind of permissions.

HDFS ACLs are designed on the base concept of POSIX ACLs of UNIX systems.

How to do it…

First of all, you will need to enable ACLs in Hadoop. To enable ACL permissions, configure the following property in Hadoop-configure file named hdfs-site.xml located at <HADOOP_HOME>/etc/hadoop/hdfs-site.xml:

<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
</property>

There are two main commands that are used to configure ACLs: setfacl and getfacl. The command setfacl is used to set Finer Access Control Lists (FACL) for files or directories, and getfacl is used to retrieve Finer Access Control Lists (FACL) for files or directories.

Let's see how to use these commands:

hdfs dfs -setfacl [-R] [-b |-k -m |-x <acl_specification> <path>] |[--set <acl_specification> <path>]

The same command can be run using hadoop fs also, as follows:

hadoop fs -setfacl [-R] [-b |-k -m |-x <acl_specification> <path>] |[--set <acl_specification> <path>]

This command contains the following elements:

-R is used to apply operation recursively for all files and subdirectories under a directory.

-b is used to remove all ACLs except the base ACLs.

-k is used to remove the default ACLs.

-m is used to modify ACLs. Using this option, new entries are added to the existing set of ACLs.

-x is used to remove specific ACLs.

acl_specification is a comma-separated list of ACLs.

path is the path of a file or directory for which ACL has to be applied.

--set is used to set new ACLs. It removes all existing ACLs and set the new ACLs only.

Now, let's see another command that is used to retrieve the ACLs:

hdfs dfs -getfacl [-R] <path>

This command can also be run using hadoop fs as follows:

hadoop fs -getfacl [-R] <path>

This command contains the following elements:

-R is used to retrieve ACLs recursively for all files and subdirectories under a directory

path is a path of a file or directory of which ACL is to be retrieve

The command getfacl will list all default ACLs as well as new ACLs defined for specified files or directories.

How it works…

If ACLs are defined for a file or directory, then while accessing that file/directory, access is validated as given in the following algorithm:

If the username is the same as the owner name of the file, then owner permissions are enforced

If username matches with one of named user ACL entry, then those permissions are enforced

If a user's group name matches with one of the named group ACL entry, then those permissions are enforced

In case multiple ACLs entries found for a user, then the union of all those permissions is enforced

If no ACL entry found for a user, then other permissions are enforced

Let's assume that we have a file named stock-data containing stock market data. To retrieve all ACLs of this file, run the following command after which the output is shown in the screenshot given later:

$ hadoop fs -getfacl /stock-data

hive-security-img-1

Because we have not defined any custom ACL for this file, as shown in the previous screenshot, command will return default ACL for this file.

You can check the permissions of a file or directory using the ls command also. As shown in the previous screenshot, the permission set for stock-data file is -rw-r--r--, which means read and write access for owner as well as read access for group members and others.

In the following command, we give read and write access to user named mike, and the result is shown in the following screenshot:

$ hadoop fs -setfacl -m user:mike:rw- /stock-data

hive-security-img-2

As shown in the previous screenshot, first, we defined the ACLs for the user mike using setfacl command; then, we retrieved the ACLs using the getfacl command.

The output of the getfacl command will list out all default permissions as well as all ACLs. We defined ACLs for the user mike, so in the output, there is an extra row user:mike:rw-.

There is an extra row in the output mask::rw-, which defines special mask ACLs entry. Mask is a special type of ACLs that filters out the access for all named users, named groups, and unnamed groups. If you have not defined mask ACL, then its value is calculated using the union of all permissions.

In addition to this, the output of the ls command is also changed after defining ACLs. There is an extra plus (+) sign in permissions list that indicates that there are additional ACLs defined for this file or directory.

Revoking access of user mike. To remove a specific ACL -x option is used with the setfacl command:

$ hadoop fs -setfacl -x user:mike /stock-data

hive-security-img-3

In the previous screenshot, after revoking access of user mike, ACLs are updated, and there is no entry for the user mike now.

Authorizing Hive

Hive authorization is to verifying that a user is authorized to perform particular action. Authentication is about verifying the identity of a user, which is different from the authorization concept.

Hive can be used in the following different ways:

Using HCatalog API: Hive's Hcatalog API is used to access Hive by many other frameworks such as Apache Pig, MapReduce, Facebook Presto, Spark SQL, and Cloudera Impala. Using the HCatalog API, users have direct access to HDFS data and hive metadata. Hive metadata is directly accessible using metastore server API.

Using Hive CLI: Using Hive CLI also, users have direct access to HDFS data and Hive metadata. Hive CLI directly interacts with a Hive metastore server. Currently Hive CLI don't support rich authorization. In next versions of hive, Hive CLI's implementation will be changed to provide better security, and also Hive CLI will interact with HiveServer2 rather than directly interacting with the metastore server.

Using ODBC/JDBC and other HiveServer2 clients such as Beeline: These clients don't have direct access to HDFS data and metadata but through HiveServer2. For security purpose, this is the best way to access Hive.

How to do it…

The following are the various ways of authorization in Hive:

Default authorization: the legacy mode: The legacy authorization mode was available in earlier versions of Hive. This authorization scheme prevents the users from doing some unwanted actions. This scheme doesn't prevent malicious users from doing activities.
It manages the access control using grant and revoke statements. This mode supports Hive Command Line Interface (Hive CLI). In case of Hive CLI, users have direct access to HDFS files and directories, so they can easily break the security checks. Also in this model for granting privileges, the permissions needed for a user are not defined, which means that any user can grant the access to themselves, so it is not secure to use this model.

Storage-based Authorization: As storage perspective, both HDFS data as well as Hive metadata must be accessed only to authorized users. If users use the HCatalog API or Hive CLI, then they have direct access to data. To protect the data, HDFS ACLs are being enabled. In this mode, HDFS permissions work as a single source of truth for protecting data.Generally in Hive metastore, database credentials are configured in Hive configuration file hive-site.xml. Malicious users can easily read the metastore credentials and then cause serious damage to data as well as metadata, so Hive metastore server can also be secured.

In this authorization, you can also enable security at metastore level. After enabling metastore security, it will restrict the access on metadata objects by verifying that users have respective system permissions corresponding to different files and directories of metadata objects.

To configure the storage-based authorization, set the following properties in the hive-site.xml file:

Property	Value
hive.metastore.pre.event.listeners	org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener
hive.security.metastore.authorization.manager	org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider
hive.security.metastore.authenticator.manager	org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator
hive.security.metastore.authorization.auth.reads	true

After setting all these configurations, Hive configuration file hive-site.xml will look as follows:

<configuration>
<property>
   <name>hive.metastore.pre.event.listeners</name>
<value>org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener</value>
</property>
<property>
   <name>hive.security.metastore.authorization.manager</name>
   <value>org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider</value>
</property>
<property>
   <name>hive.security.metastore.authenticator.manager</name>
   <value>org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator</value>
</property>
<property>
   <name>hive.security.metastore.authorization.auth.reads</name>
   <value>true</value>
</property>
</configuration>

hive.metastore.pre.event.listeners: This property is used to define pre-event listener class, which is loaded on the metastore side. APIs of this class are executed before occurring of any event such as creating a database, table, or partition; altering a database, table, or partition; or dropping a database, table, or partition. Configuring this property turns on security at a metastore level. Set value of this property to org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener.

hive.security.metastore.authorization.manager: This property is used to define the authorization provider class for metastore security. The default value of this property is DefaultHiveMetastoreAuthorizationProvider, which provides default legacy authorization described in the previous bullet. To enable storage-based authorization based on Hadoop ACLs, set value of this property to org.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider. You can also write your own custom class to manage authorization and configure this property to enable custom authorization manager. The custom authorization manager class must implement an interface org.apache.hadoop.hive.ql.security.authorization.HiveMetastoreAuthorizationProvider.

hive.security.metastore.authenticator.manager: This property is used to define an authentication manager class. Set value of this property to org.apache.hadoop.hive.ql.security.HadoopDefaultMetastoreAuthenticator. You can also write your custom class to manage authentication and configure to this property. The custom authentication manager class must implement an interface org.apache.hadoop.hive.ql.security.HiveAuthenticationProvider.

hive.security.metastore.authorization.auth.reads: This property is used to define whether metastore authorization should check for read access or not. The default value of this property is true.

SQL Standard-based Authorization: SQL standard-based authorization is the third way of authorizing Hive. Although the previous methodology storage-based authorization also provides access control at level of partitions, tables, and databases, that methodology does not provide access control at more granular level such as columns and rows. This is because storage-based authorization depends on access control provided by HDFS using ACL that controls the access on the level of files and directories.

SQL Standard-based authorization can be used to enforce fine-grained security. It is recommended to use as it is fully SQL compliant in its authorization model.

There's more

Many things can be done with SQL standard-based authorization. Use SQL standard-based authorization for more details.

Summary

In this article, we learned two different recipes Securing Hadoop and Authorizing Hive. You also learned different terminology of access permissions and their types. You went through various steps to secure Hadoop and learned different ways to perform authorization in Hive.