I have a pandas CSVDataSet and I am trying out a coding styl Kedro #questions

I have a pandas.CSVDataSet and I am trying out a c...

Galen Seilis

08/24/2023, 2:13 PM

I have a pandas.CSVDataSet and I am trying out a coding style with Kedro where I avoid wrapping boilerplate functions around common pandas functions. This is done with lambda functions. Here is an example where I drop duplicate rows.

Copy code

node(
    func=lambda data:data.drop_duplicates(),
    inputs='some_data_set',
    outputs='dup_dropped_data',
    name='drop_duplicates'
)

But there are plenty of commands in Pandas that act on only a single column, like replace.

Copy code

node(
    func=lambda data:data.some_column.replace({'meow':'woof'}),
    inputs='some_data_set'
    outputs='demeowed_data',
    name='remove_meowing'
)

The latter example will only return the Pandas series for

some_column

. Is there a way to change a single column but return the entire dataframe?

datajoely

08/24/2023, 2:18 PM

We don’t have a

literal()

it’s been asked before, but not enough people demanded it so it looks like the issue closed https://github.com/kedro-org/kedro/issues/526

👍 1

Galen Seilis

08/24/2023, 2:23 PM

@datajoely Thank you for sending that link and summarizing the current state.

datajoely

08/24/2023, 2:30 PM

you can also use functools to partially apply a literal https://waylonwalker.com/kedro-node/#using-a-partial-function

👍 1

Lodewic van Twillert

08/24/2023, 3:05 PM

@Galen Seilis Correct me if im wrong but I think this more of a Pandas question rather than Kedro. If so, at least I can add something about pandas:) In this particular example I believe you can also apply

.replace

to the whole DataFrame at once. You can pass dictionaries to specify which values should be replaced in which columns - so you can replace values in multiple columns at the same time.

Copy code

node(
    func=lambda data: data.replace(to_replace={'some_column': 'meow'}, value={'some_column':'woof'}),
    inputs='some_data_set'
    outputs='demeowed_data',
    name='remove_meowing'
)

But let's assume you still want to use methods that only apply to a single column You can always use this syntax using

pd.DataFrame.assign()

to re-assign an existing column and return the dataframe anyway, using another lambda within the

.assign()

your_dataframe.assign(some_value=lambda d: do_something(d))

<- in this case

is your entire dataframe that you are applying the

.assign

to. Use it like this in your node if you want

Copy code

node(
    func=lambda data: data.assign(your_column=lambda d: d.your_column.replace({'meow':'woof'})),
    inputs='some_data_set'
    outputs='demeowed_data',
    name='remove_meowing'
)

--edit: even easier might be this, dropping the lambda within

.assign()

Copy code

node(
    func=lambda data: data.assign(your_column=data.your_column.replace({'meow':'woof'})),
    inputs='some_data_set'
    outputs='demeowed_data',
    name='remove_meowing'
)

👍 1

Galen Seilis

08/25/2023, 3:08 AM

@Lodewic van Twillert Thank you, the approach you've illustrated satisfies the constraint I am interested in! I had not noticed the

value

parameter of

pandas.DataFrame.replace

. I appreciate you pointing out that using it allows the entire dataframe to be returned.

👍 1

Iñigo Hidalgo

08/28/2023, 12:00 PM

Kinda late here but I tend to use the df.assign functionality for what you're proposing, like @Lodewic van Twillert’s final edited suggestion.

3 Views

Open in Slack

Previous Next