Hello I am Fernando and I have been using kedro for a few mo Kedro #questions

Hello, I am Fernando and I have been using kedro ...

Fernando Cabeza

09/11/2024, 4:20 PM

Hello, I am Fernando and I have been using kedro for a few months now, it is an amazing tool for data scientists. However I have run into a problem when using callable (lambda functions) in usecols for the catalog.yml file. I don't know the right way to do something like this in kedro:

Copy code

check_cols = ['a','b','c', 'd']
df = pd.read_csv(
    path,
    sep= ";" , 
    usecols = lambda x: x in check_cols
)

The reason for solving this problem is that I have a lot of similar csv files with the same columns (using Kedro dataset factories), but some columns are missing in some files. Going back to the example imagine one of the csv's was missing column d I would still want to load the columns a, b, c. Could you help me with this problem using YAML.API?

datajoely

09/11/2024, 4:21 PM

You're looking for an omega conf resolver! https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#how-to-use-resolvers-in-the-omegaconfigloader

🚀 1

🥳 1

Yury Fedotov

09/11/2024, 10:22 PM

@Fernando Cabeza thanks for posting this, I was actually looking for a solution literally to this and literally for the same reason 👋 Thanks @datajoely for this link, very helpful

Fernando Cabeza

09/12/2024, 7:21 AM

Thank you very much @datajoely for your time and quick response. It has been a great help to me. I'm glad @Yury Fedotov found this thread helpful as well.

datajoely

09/12/2024, 7:22 AM

This is one of my favourite features the team built in years, I hit this wall so many times before we switched to omegaconf

❤️ 2

Fernando Cabeza

09/12/2024, 7:27 AM

Yes, it is very useful to manage many similar csv files. Thanks for working on it.

Vishal Pandey

09/13/2024, 7:59 AM

@datajoely Can you explain in layman terms how does omegaconfresolver can be helpful while developing projects in Kedro. Example scenarios will be much helpful.

Fernando Cabeza

09/13/2024, 9:31 AM

Hello @Vishal Pandey in relations with the example : catalog.yml:

Copy code

df_{year}:
  type: pandas.CSVDataset
  filepath: path_with_{year}_dependence.csv
  load_args:
    usecols:   "${usecols_callable:}"

settings.py:

Copy code

check_cols = ['a','b','c', 'd']

CONFIG_LOADER_ARGS = {
....
    "custom_resolvers": {
        "usecols_callable": lambda: lambda x: x in check_cols ,
    },
}

That's correct, however, if you want to use ParallelRunner you will have problems with AttributeError: Can't pickle local object 'lambda.locals.lambda'

Vishal Pandey

09/13/2024, 9:45 AM

@Fernando Cabeza why do we get this error in case of ParallelRunners

datajoely

09/13/2024, 9:46 AM

The short answer is because parallelism in Python is sucks, some complex objects simply can't do this.

Vishal Pandey

09/13/2024, 9:49 AM

@datajoely Does that mean we can fill dynamic values in catalog.yml anywhere like in sql queries -

select * from table_name where  table_name.data == ${begin_date}

and we can define

begin_date

in custom_resolvers .

datajoely

09/13/2024, 9:50 AM

you can actually do that without a custom resolver, you can use globals

👍 1

3 Views

Open in Slack

Previous Next