Not exactly a support question, but for people who...
# questions
d
Not exactly a support question, but for people who use/have considered using
PartitionedDataSet
... Let's say I have a catalog entry like:
Copy code
my_pds:
  type: PartitionedDataSet
  path: data/01_raw/subjects
  dataset:
    type: my_project.io.MyCustomDataSet
And data like:
Copy code
data/01_raw/subjects/C001/scans/0.png
data/01_raw/subjects/C001/scans/1.png
data/01_raw/subjects/C001/scans/2.png
data/01_raw/subjects/C001/test_results.csv
data/01_raw/subjects/C001/notes.png
data/01_raw/subjects/C002/scans/0.png
data/01_raw/subjects/C002/scans/1.png
data/01_raw/subjects/C002/scans/2.png
data/01_raw/subjects/C002/test_results.csv
data/01_raw/subjects/C002/notes.png
data/01_raw/subjects/T001/scans/0.png
data/01_raw/subjects/T001/scans/1.png
data/01_raw/subjects/T001/scans/2.png
data/01_raw/subjects/T001/test_results.csv
data/01_raw/subjects/T001/notes.png
What do you think the resulting partitions would be?
j
This is something I’ve questioned as well when I have data nested at different levels. I think in my case I ended up avoiding the problem by putting everything in the same working directory and placing the folder info into the file names themselves. But in my case, all the data was the same type
d
I was curious because I came across this while helping somebody yesterday, and the behavior was different than I intuitively expected, despite having used
PartitionedDataSet
on several occasions in the past. What happens is that every file under there (at any level) becomes a partition--a result of finding every file under there recursively--rather than using the top-level file or folder as the partition key. I kinda expected the latter, but may just be me.
K 1