pyspark.sql.DataFrameWriter.bucketBy#
- DataFrameWriter.bucketBy(numBuckets, col, *cols)[source]#
- Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive’s bucketing scheme, but with a different bucket hash function and is not compatible with Hive’s bucketing. - New in version 2.3.0. - Changed in version 3.4.0: Supports Spark Connect. - Parameters
- numBucketsint
- the number of buckets to save 
- colstr, list or tuple
- a name of a column, or a list of names. 
- colsstr
- additional names (optional). If col is a list it should be empty. 
 
 - Notes - Applicable for file-based data sources in combination with - DataFrameWriter.saveAsTable().- Examples - Write a DataFrame into a Parquet file in a buckted manner, and read it back. - >>> from pyspark.sql.functions import input_file_name >>> # Write a DataFrame into a Parquet file in a bucketed manner. ... _ = spark.sql("DROP TABLE IF EXISTS bucketed_table") >>> spark.createDataFrame([ ... (100, "Hyukjin Kwon"), (120, "Hyukjin Kwon"), (140, "Haejoon Lee")], ... schema=["age", "name"] ... ).write.bucketBy(2, "name").mode("overwrite").saveAsTable("bucketed_table") >>> # Read the Parquet file as a DataFrame. ... spark.read.table("bucketed_table").sort("age").show() +---+------------+ |age| name| +---+------------+ |100|Hyukjin Kwon| |120|Hyukjin Kwon| |140| Haejoon Lee| +---+------------+ >>> _ = spark.sql("DROP TABLE bucketed_table")