I have a pandas.CSVDataSet and I am trying out a c...
# questions
g
I have a pandas.CSVDataSet and I am trying out a coding style with Kedro where I avoid wrapping boilerplate functions around common pandas functions. This is done with lambda functions. Here is an example where I drop duplicate rows.
Copy code
node(
    func=lambda data:data.drop_duplicates(),
    inputs='some_data_set',
    outputs='dup_dropped_data',
    name='drop_duplicates'
)
But there are plenty of commands in Pandas that act on only a single column, like replace.
Copy code
node(
    func=lambda data:data.some_column.replace({'meow':'woof'}),
    inputs='some_data_set'
    outputs='demeowed_data',
    name='remove_meowing'
)
The latter example will only return the Pandas series for
some_column
. Is there a way to change a single column but return the entire dataframe?
d
We don’t have a
literal()
it’s been asked before, but not enough people demanded it so it looks like the issue closed https://github.com/kedro-org/kedro/issues/526
👍 1
g
@datajoely Thank you for sending that link and summarizing the current state.
d
you can also use functools to partially apply a literal https://waylonwalker.com/kedro-node/#using-a-partial-function
👍 1
l
@Galen Seilis Correct me if im wrong but I think this more of a Pandas question rather than Kedro. If so, at least I can add something about pandas:) In this particular example I believe you can also apply
.replace
to the whole DataFrame at once. You can pass dictionaries to specify which values should be replaced in which columns - so you can replace values in multiple columns at the same time.
Copy code
node(
    func=lambda data: data.replace(to_replace={'some_column': 'meow'}, value={'some_column':'woof'}),
    inputs='some_data_set'
    outputs='demeowed_data',
    name='remove_meowing'
)
But let's assume you still want to use methods that only apply to a single column You can always use this syntax using
pd.DataFrame.assign()
to re-assign an existing column and return the dataframe anyway, using another lambda within the
.assign()
your_dataframe.assign(some_value=lambda d: do_something(d))
<- in this case
d
is your entire dataframe that you are applying the
.assign
to. Use it like this in your node if you want
Copy code
node(
    func=lambda data: data.assign(your_column=lambda d: d.your_column.replace({'meow':'woof'})),
    inputs='some_data_set'
    outputs='demeowed_data',
    name='remove_meowing'
)
--edit: even easier might be this, dropping the lambda within
.assign()
Copy code
node(
    func=lambda data: data.assign(your_column=data.your_column.replace({'meow':'woof'})),
    inputs='some_data_set'
    outputs='demeowed_data',
    name='remove_meowing'
)
👍 1
g
@Lodewic van Twillert Thank you, the approach you've illustrated satisfies the constraint I am interested in! I had not noticed the
value
parameter of
pandas.DataFrame.replace
. I appreciate you pointing out that using it allows the entire dataframe to be returned.
👍 1
i
Kinda late here but I tend to use the df.assign functionality for what you're proposing, like @Lodewic van Twillert’s final edited suggestion.