InfoQ Homepage News Google Introduces Cloud Storage Connector for Hadoop Big Data Workloads
Google Introduces Cloud Storage Connector for Hadoop Big Data Workloads
This item in japanese
Sep 09, 2019 2 min read
by
Write for InfoQ
Feed your curiosity. Help 550k+ globalsenior developers
each month stay ahead.Get in touch
In a recent blog post, Google announced a new Cloud Storage connector for Hadoop. This new capability allows organizations to substitute their traditional Hadoop Distributed File System (HDFS) with Google Cloud Storage. Columnar file formats such as Parquet and ORC may realize increased throughput, and customers will also benefit from Cloud Storage directory isolation, lower latency, increased parallelization and intelligent defaults.
While HDFS is a popular storage solution for Hadoop customers, it can be operationally complex, for example when maintaining long-running HDFS clusters. Google Cloud Storage is a unified object store that exposes data through a unified API. It is a managed solution that supports both high-performance and archival use cases. The Cloud Storage connector is an open-source Java client library that implements Hadoop Compatible FileSystem (HCFS) and runs inside Hadoop JVMs, which allows big-data processes, like Hadoop or Spark jobs, access to underlying data from Google Cloud Storage.
Google believes that there are many opportunities when using Google Cloud Storage over HDFS, including:
- Significant cost reduction as compared to a long-running HDFS cluster with three replicas on persistent disks;
- Separation of storage from compute, allowing you to grow each layer independently;
- Persisting the storage even after Hadoop clusters are terminated;
- Sharing Cloud Storage buckets between ephemeral Hadoop clusters;
- No storage administration overhead, like managing upgrades and high availability for HDFS.
Even though the connector is open-source, it is supported by Google Cloud Platform and comes pre-configured in Cloud Dataproc, Google’s fully managed service for running Apache Hadoop and Apache Spark workloads. In addition, it can be installed and fully supported in other Hadoop distributions, including MapR, Cloudera and Hortonworks. This interoperability allows customers to transition their on-premises big data solutions to the cloud.
Using Cloud Storage in Hadoop implementations, offers customers performance improvements. One customer who has been able to take advantage of improved performance is Twitter. In Twitter’s implementation, they:
Started testing big data SQL queries against columnar files in Cloud Storage at massive scale, against a 20+ PB dataset. Since the Cloud Storage Connector is open source, Twitter prototyped the use of range requests to read only the columns required by the query engine, which increased read efficiency. We (Google) incorporated that work into a more generalized fadvise feature.
Another capability introduced as part of the Cloud Storage connector is cooperative locking, which isolates storage modification operations executed by the Hadoop file system shell. Igor Dvorzhak, a software engineer, explains the importance of this feature:
Although Cloud Storage is strongly consistent at the object level, it does not natively support directory semantics. For example, what should happen if two users issue conflicting commands (delete vs. rename) to the same directory? In HDFS, such directory operations are atomic and consistent.
To address cooperative locking, Google worked with Twitter to implement the feature in the Cloud Storage connector that prevents data inconsistencies during competing directory operations.
Existing Cloud Storage Connector customers can upgrade to the new version of Cloud Storage Connector using the connectors initialization action for existing Cloud Dataproc versions. As of Cloud Dataproc version 2.0, it will become the standard connector.
This content is in the Cloud topic
Related Topics:
-
Related Editorial
-
Related Sponsors
-
Popular across InfoQ
-
ArrowJS Reaches 1.0, Recast as the First UI Framework for the Agentic Era
-
Anthropic Releases and Temporarily Suspends Claude Fable 5
-
Slack Eliminates SSH in EMR Pipelines, Migrates 700+ Jobs to Rest-Based Architecture
-
Anthropic Explains How Claude Builds Its Own Execution Harnesses
-
Spring Boot 4.1 Adds gRPC Auto-Configuration, SSRF Mitigation, and Kotlin 2.3 Support
-
Increasing Users' Data Agency: From BlueSky's AT Protocol to the Local-First Software Movement
-
The InfoQ Newsletter
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example
