VOOZH about

URL: https://www.analyticsvidhya.com/blog/2020/10/getting-started-with-apache-hive/

⇱ Apache Hive for Data Engineering | Getting Started With Apache Hive


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Getting Started with Apache Hive – A Must Know Tool For all Big Data and Data Engineering Professionals

Lakshay arora Last Updated : 14 Dec, 2020
6 min read

Overview

  • Understand the Apache Hive architecture and its working.
  • We will learn to do some basic operations in Apache Hive.

Introduction

Most of the Data Scientists use SQL queries in order to explore the data and get valuable insights from them. Now, as the volume of data is growing at such a high pace, we need new dedicated tools to deal with big volumes of data.

Initially, Hadoop came up and became one of the most popular tools to process and store big data. But developers were required to write complex map-reduce codes to work with Hadoop. This is Facebook’s Apache Hive came to rescue. It is another tool designed to work with Hadoop. We can write SQL like queries in the hive and in the backend it converts them into the map-reduce jobs.

πŸ‘ Image

In this article, we will see the architecture of the hive and its working. We will also learn how to do simple operations like creating a database and table, loading data, modifying the table.

Table of Contents

  1. What is Apache Hive?
  2. Apache Hive Architecture
  3. Working of Apache Hive
  4. Data Types in Apache Hive
  5. Create and Drop Database
  6. Create and Drop Table
  7. Load Data into Table
  8. Alter Table
  9. Advantages/Disadvantages of Hive

What is Apache Hive?

πŸ‘ apache hive

Apache Hive is a data warehouse system developed by Facebook to process a huge amount of structure data in Hadoop. We know that to process the data using Hadoop, we need to right complex map-reduce functions which is not an easy task for most of the developers. Hive makes this work very easy for us.

It uses a scripting language called HiveQL which is almost similar to the SQL. So now, we just have to write SQL-like commands and at the backend of Hive will automatically convert them into the map-reduce jobs.

Apache Hive Architecture

Let’s have a look at the following diagram which shows the architecture.

πŸ‘ hive architecture

  • Hive Clients: It allows us to write hive applications using different types of clients such as thrift server, JDBC driver for Java, and Hive applications and also supports the applications that use ODBC protocol.
  • Hive Services: As a developer, if we wish to process any data, we need to use the hive services such as hive CLI (Command Line Interface). In addition to that hive also provides a web-based interface to run the hive applications.
  • Hive Driver: It is capable of receiving queries from multiple resources like thrift, JDBC, and ODBS using the hive server and directly from hive CLI and web-based UI. After receiving the queries, it transfers it to the compiler.
  • HiveQL Engine: It receives the query from the compiler and converts the SQL like query into the map-reduce jobs.
  • Meta Store: Here hive stores the meta-information about the databases like schema of the table, data types of the columns, location in the HDFS, etc
  • HDFS: It is simply the Hadoop distributed file system used to store the data. I would highly recommend you to go through this article to learn more about the HDFS: Introduction to the Hadoop Ecosystem

Working of Apache Hive

Now, let’s have a look at the working of the Hive over the Hadoop framework.

πŸ‘ apache hive vs hadoop

  1. In the first step, we write down the query using the web interface or the command-line interface of the hive. It sends it to the driver to execute the query.
  2. In the next step, the driver sends the received query to the compiler where the compiler verifies the syntax.
  3. And once the syntax verification is done, it requests metadata from the meta store.
  4. Now, the metadata provides information like the database, tables, data types of the column in response to the query back to the compiler.
  5. The compiler again checks all the requirements received from the meta store and sends the execution plan to the driver.
  6. Now, the driver sends the execution plan to the HiveQL process engine where the engine converts the query into the map-reduce job.
  7. After the query is converted into the map-reduce job, it sends the task information to the Hadoop where the processing of the query begins and at the same time it updates the metadata about the map-reduce job in the meta store.
  8. Once the processing is done, the execution engine receives the results of the query.
  9. The execution engine transfers the results back to the driver and which finally sends to the hive user-interface from where we can see the results.

Data Types in Apache Hive

Hive data types are divided into the following 5 different categories:

  1. Numeric Type: TINYINT, SMALLINT, INT, BIGINT
  2. Date/Time Types: TIMESTAMP, DATE, INTERVAL
  3. String Types: STRING, VARCHAR, CHAR
  4. Complex Types: STRUCT, MAP, UNION, ARRAY
  5. Misc Types: BOOLEAN, BINARY

Here is a small description of a few of them.

πŸ‘ Apache hive - data types

Create and Drop Database

Creating and Dropping database is very simple and similar to the SQL. We need to assign a unique name to each of the databases in the hive. If the database already exists, it will show a warning and to suppress this warning you can add the keywords IF NOT EXISTS after the database keyword.

CREATE DATABASE <<database_name>> ;

Dropping a database is also very simple, you just need to write a drop database and the database name to be dropped. If you try to drop the database that doesn’t exist, it will give you the SemanticException error.

DROP DATABASE <<database_name>> ;

Create Table

We use the create table statement to create a table and the complete syntax is as follows.

CREATE TABLE IF NOT EXISTS <<database_name.>><<table_name>> 
 (column_name_1 data_type_1, 
 column_name_2 data_type_2,
 .
 .
 column_name_n data_type_n)
 ROW FORMAT DELIMITED FIELDS 
 TERMINATED BY '\t'
 LINES TERMINATED BY '\n'
 STORED AS TEXTFILE;

If you are already using the database, you are not required to write database_name.table_name. In that case, you can only write the table name. In the case of Big Data, most of the time we import the data from external files so here we can pre-define the delimiter used in the file, line terminator and we can also define how we want to store the table.

There are 2 different types of hive tables Internal and External tables. Please go through this article to know more about the concept: Types of Tables in Apache Hive: A Quick Overview

Load Data into Table

Now, the tables have been created. It’s time to load the data into it. We can load the data from any local file on our system using the following syntax.

LOAD DATA LOCAL INPATH <<path of file on your local system>> 
 INTO TABLE
 <<database_name.>><<table_name>> ;

When we work with a huge amount of data, there is a possibility of having unmatched data types in some of the rows. In that case, the hive will not throw any error rather it will fill null values in place of them. This is a very useful feature as loading big data files into the hive is an expensive process and we do not want to load the entire dataset just because of few files.

Alter Table

In the hive, we can do multiple modifications to the existing tables like renaming the tables, adding more columns to the table. The commands to alter the table are very much similar to the SQL commands.

Here is the syntax to rename the table:

ALTER TABLE <<table_name>> RENAME TO <<new_name>> ;

Syntax to add more columns from the table:

## to add more columns
ALTER TABLE <<table_name>> ADD COLUMNS 
 (new_column_name_1 data_type_1,
 new_column_name_2 data_type_2,
 . 
 .
 new_column_name_n data_type_n) ;

Advantages/Disadvantages of Apache Hive

  • Uses SQL like query language which is already familiar to most of the developers so makes it easy to use.
  • It is highly scalable, you can use it to process any size of data.
  • Supports multiple databases like MySQL, derby, Postgres, and Oracle for its metastore.
  • Supports multiple data formats also allows indexing, partitioning, and bucketing for query optimization.
  • Can only deal with cold data and is useless when it comes to processing real-time data.
  • It is comparatively slower than some of its competitors. If your use-case is mostly about batch processing then Hive is well and fine.

End Notes

In this article, we have seen the architecture of the Apache Hive and its working and some of the basic operations to get started with. In the next article of this series, we will see some of the more complex and important concepts of partitioning and bucketing in a hive.

If you have any questions related to this article do let me know in the comments section below.

Ideas have always excited me. The fact that we could dream of something and bring it to reality fascinates me. Computer Science provides me a window to do exactly that. I love programming and use it to solve problems and a beginner in the field of Data Science.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
πŸ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
πŸ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

πŸ‘ Popup Banner
πŸ‘ AI Popup Banner