SerDe
SerDe stands for Serialization and Deserialization. It is the technology used to process records and map them to column data types in Hive tables. To explain the scenario of using SerDe, we need to understand how Hive reads and writes data first.
The process to read data is as follows.
- Data is read from HDFS.
- Data is processed by the
INPUTFORMAT
implementation, which defines the input data split and key/value records. In Hive, we can useCREATE TABLE ... STORED AS <FILE_FORMAT>
(see Chapter 9, Performance Considerations) to specify whichINPUTFORMAT
it reads from. - The Java
Deserializer
class defined in SerDe is called to format the data into a record that maps to column and data types in a table.
For an example of reading data, we can use JSON SerDe to read the TEXTFILE
format data from HDFS and translate each row of the JSON attribute and value to rows in Hive tables with the correct schema.
The process to write data is as follows:
- Data (such as using an
INSERT
statement) to be written...