Benjamin Cheung
06/30/2024, 6:55 PMYury Fedotov
07/01/2024, 1:51 AMPartitionedDataset
(docs).
In short, how it works is that you define it like this:
my_partitioned_dataset:
type: partitions.PartitionedDataset
path: <s3://my-bucket-name/path/to/folder> # path to the location of partitions
dataset: pandas.CSVDataset # shorthand notation for the dataset which will handle individual partitions
And that means:
1. Go to the folder specified in path
2. Read all items as individual datasets (in this case pandas.CSVDataset
)
3. On load, it would return a dict[str, object]
thing where str
is a filename and object
is whatever your dataset
would read - in example above it would be pd.DataFrame
.Richard Purvis
07/01/2024, 8:17 PM