pyspark.sql.datasource.DataSourceReader.read#
- abstract DataSourceReader.read(partition)[source]#
- Generates data for a given partition and returns an iterator of tuples or rows. - This method is invoked once per partition to read the data. Implementing this method is required for readable data sources. You can initialize any non-serializable resources required for reading data from the data source within this method. - Parameters
- partitionobject
- The partition to read. It must be one of the partition values returned by - DataSourceReader.partitions().
 
- Returns
- iterator of tuples or PyArrow’s RecordBatch
- An iterator of tuples or rows. Each tuple or row will be converted to a row in the final DataFrame. It can also return an iterator of PyArrow’s RecordBatch if the data source supports it. 
 
 - Examples - Yields a list of tuples: - >>> def read(self, partition: InputPartition): ... yield (partition.value, 0) ... yield (partition.value, 1) - Yields a list of rows: - >>> def read(self, partition: InputPartition): ... yield Row(partition=partition.value, value=0) ... yield Row(partition=partition.value, value=1)