So I asked ChatGPT to write a custom kedro dataset class for Kedro #random

So I asked ChatGPT to write a custom kedro dataset...

Andrew Stewart

02/02/2023, 6:28 PM

So I asked ChatGPT to write a custom kedro dataset class for Athena that uses

awswrangler

, and it did a pretty decent job!

Copy code

import awswrangler as wr
import pandas as pd

from <http://kedro.io|kedro.io> import DataSetError, AbstractDataSet

class AthenaDataSet(AbstractDataSet):
    def __init__(self, query, s3_output, database, region_name='us-west-2'):
        self._query = query
        self._s3_output = s3_output
        self._database = database
        self._region_name = region_name

    def _load(self):
        df = wr.athena.read_sql_query(self._query, database=self._database, region_name=self._region_name)
        return df

    def _save(self, df: pd.DataFrame):
        wr.athena.to_parquet(
            df=df,
            path=self._s3_output,
            database=self._database,
            region_name=self._region_name
        )

    def _exists(self):
        return wr.s3.exists(self._s3_output)

    def _describe(self):
        return f"AthenaDataSet: query={self._query}, s3_output={self._s3_output}, database={self._database}"

It even included a usage example:

Copy code

# catalog.yml
...
datasets:
  my_dataset:
    type: my_module.AthenaDataSet
    query: "SELECT * FROM my_table"
    s3_output: "<s3://my-bucket/my-data/my_table>"
    database: "my_database"
    region_name: "us-west-2"
...

😮 3

🎉 11

datajoely

02/02/2023, 6:51 PM

That’s pretty cool

Andrew Stewart

02/02/2023, 7:00 PM

Yeah but then I was like "Hey ChatGPT, finish the rest of my project for me while I go get lunch" and it was like "Pfff, show me the money pal!"

🤣 1

Andrew Stewart

02/02/2023, 7:00 PM

Kinda cheeky

datajoely

02/02/2023, 7:00 PM

😂

datajoely

02/02/2023, 7:01 PM

Interestingly when the team here were playing with it it introduced syntax to kedro which wasn’t valid. It’s a super interesting space and it will only get better, it’s just concerning that when it’s wrong its really plausibly correct

😨 1

datajoely

02/02/2023, 7:02 PM

I’m also disappointed it didn’t do typehints and docstrings 😛

Andrew Stewart

02/02/2023, 10:51 PM

Oh I did a other iteration and it included both

K 1

Ben Horsburgh

02/03/2023, 2:56 PM

The wrongness might be a great user discovery tool - it's wrong based on stats models, which means it has observed many similar instances in its training. Trick is discerning if that is many great examples, or nonsense!

Andrew Stewart

02/03/2023, 5:04 PM

also important to remember the LM is trained on data only up until '21

Andrew Stewart

02/03/2023, 5:04 PM

and software projects move fast

Andrew Stewart

02/10/2023, 12:51 AM

@datajoely I gave ChatGPT a slightly harder challenge. Does this pass the eye test?

Copy code

import pyspark
from pyspark.sql import SparkSession
import awswrangler as wr
from <http://kedro.io|kedro.io> import AbstractDataSet, DataSetError

class AthenaAwswranglerDataSet(AbstractDataSet):
    def __init__(self, table: str, database: str = None, s3_staging_dir: str = None, spark_session: SparkSession = None, **kwargs):
        self._table = table
        self._database = database
        self._s3_staging_dir = s3_staging_dir
        self._spark = spark_session or SparkSession.builder.getOrCreate()
        super().__init__(**kwargs)

    def _save(self, data: pyspark.sql.DataFrame) -> None:
        try:
            data.write.parquet(f"s3a://{self._s3_staging_dir}/{self._table}")
            wr.catalog.create_parquet_table(
                path=f"s3a://{self._s3_staging_dir}/{self._table}",
                table=self._table,
                database=self._database,
                if_not_exists=True
            )
        except Exception as e:
            raise DataSetError(f"Failed to save data to Athena table `{self._table}`: {e}")

    def _load(self) -> pyspark.sql.DataFrame:
        raise DataSetError("Loading data from Athena using PySpark is not supported.")

    def _exists(self) -> bool:
        raise DataSetError("Checking if an Athena table exists using PySpark is not supported.")

Andrew Stewart

02/10/2023, 12:52 AM

I think my only question is with the handling of the SparkSession; trying to compare w/ how https://kedro.readthedocs.io/en/stable/_modules/kedro/extras/datasets/spark/spark_dataset.html#SparkDataSet does it

datajoely

02/10/2023, 10:08 AM

the load method is amusing but it’s functional!

3 Views

Open in Slack

Previous Next