Hello, I am Fernando and I have been using kedro ...
# questions
f
Hello, I am Fernando and I have been using kedro for a few months now, it is an amazing tool for data scientists. However I have run into a problem when using callable (lambda functions) in usecols for the catalog.yml file. I don't know the right way to do something like this in kedro:
Copy code
check_cols = ['a','b','c', 'd']
df = pd.read_csv(
    path,
    sep= ";" , 
    usecols = lambda x: x in check_cols
)
The reason for solving this problem is that I have a lot of similar csv files with the same columns (using Kedro dataset factories), but some columns are missing in some files. Going back to the example imagine one of the csv's was missing column d I would still want to load the columns a, b, c. Could you help me with this problem using YAML.API?
d
y
@Fernando Cabeza thanks for posting this, I was actually looking for a solution literally to this and literally for the same reason 👋 Thanks @datajoely for this link, very helpful
f
Thank you very much @datajoely for your time and quick response. It has been a great help to me. I'm glad @Yury Fedotov found this thread helpful as well.
d
This is one of my favourite features the team built in years, I hit this wall so many times before we switched to omegaconf
❤️ 2
f
Yes, it is very useful to manage many similar csv files. Thanks for working on it.
v
@datajoely Can you explain in layman terms how does omegaconfresolver can be helpful while developing projects in Kedro. Example scenarios will be much helpful.
f
Hello @Vishal Pandey in relations with the example : catalog.yml:
Copy code
df_{year}:
  type: pandas.CSVDataset
  filepath: path_with_{year}_dependence.csv
  load_args:
    usecols:   "${usecols_callable:}"
settings.py:
Copy code
check_cols = ['a','b','c', 'd']

CONFIG_LOADER_ARGS = {
....
    "custom_resolvers": {
        "usecols_callable": lambda: lambda x: x in check_cols ,
    },
}
That's correct, however, if you want to use ParallelRunner you will have problems with AttributeError: Can't pickle local object 'lambda.locals.lambda'
v
@Fernando Cabeza why do we get this error in case of ParallelRunners
d
The short answer is because parallelism in Python is sucks, some complex objects simply can't do this.
v
@datajoely Does that mean we can fill dynamic values in catalog.yml anywhere like in sql queries -
select * from table_name where  table_name.data == ${begin_date}
and we can define
begin_date
in custom_resolvers .
d
you can actually do that without a custom resolver, you can use globals
👍 1