So I asked ChatGPT to write a custom kedro dataset...
# random
a
So I asked ChatGPT to write a custom kedro dataset class for Athena that uses
awswrangler
, and it did a pretty decent job!
Copy code
import awswrangler as wr
import pandas as pd

from <http://kedro.io|kedro.io> import DataSetError, AbstractDataSet

class AthenaDataSet(AbstractDataSet):
    def __init__(self, query, s3_output, database, region_name='us-west-2'):
        self._query = query
        self._s3_output = s3_output
        self._database = database
        self._region_name = region_name

    def _load(self):
        df = wr.athena.read_sql_query(self._query, database=self._database, region_name=self._region_name)
        return df

    def _save(self, df: pd.DataFrame):
        wr.athena.to_parquet(
            df=df,
            path=self._s3_output,
            database=self._database,
            region_name=self._region_name
        )

    def _exists(self):
        return wr.s3.exists(self._s3_output)

    def _describe(self):
        return f"AthenaDataSet: query={self._query}, s3_output={self._s3_output}, database={self._database}"
It even included a usage example:
Copy code
# catalog.yml
...
datasets:
  my_dataset:
    type: my_module.AthenaDataSet
    query: "SELECT * FROM my_table"
    s3_output: "<s3://my-bucket/my-data/my_table>"
    database: "my_database"
    region_name: "us-west-2"
...
😮 3
🎉 11
d
That’s pretty cool
a
Yeah but then I was like "Hey ChatGPT, finish the rest of my project for me while I go get lunch" and it was like "Pfff, show me the money pal!"
🤣 1
Kinda cheeky
d
😂
Interestingly when the team here were playing with it it introduced syntax to kedro which wasn’t valid. It’s a super interesting space and it will only get better, it’s just concerning that when it’s wrong its really plausibly correct
😨 1
I’m also disappointed it didn’t do typehints and docstrings 😛
a
Oh I did a other iteration and it included both
K 1
b
The wrongness might be a great user discovery tool - it's wrong based on stats models, which means it has observed many similar instances in its training. Trick is discerning if that is many great examples, or nonsense!
a
also important to remember the LM is trained on data only up until '21
and software projects move fast
@datajoely I gave ChatGPT a slightly harder challenge. Does this pass the eye test?
Copy code
import pyspark
from pyspark.sql import SparkSession
import awswrangler as wr
from <http://kedro.io|kedro.io> import AbstractDataSet, DataSetError

class AthenaAwswranglerDataSet(AbstractDataSet):
    def __init__(self, table: str, database: str = None, s3_staging_dir: str = None, spark_session: SparkSession = None, **kwargs):
        self._table = table
        self._database = database
        self._s3_staging_dir = s3_staging_dir
        self._spark = spark_session or SparkSession.builder.getOrCreate()
        super().__init__(**kwargs)

    def _save(self, data: pyspark.sql.DataFrame) -> None:
        try:
            data.write.parquet(f"s3a://{self._s3_staging_dir}/{self._table}")
            wr.catalog.create_parquet_table(
                path=f"s3a://{self._s3_staging_dir}/{self._table}",
                table=self._table,
                database=self._database,
                if_not_exists=True
            )
        except Exception as e:
            raise DataSetError(f"Failed to save data to Athena table `{self._table}`: {e}")

    def _load(self) -> pyspark.sql.DataFrame:
        raise DataSetError("Loading data from Athena using PySpark is not supported.")

    def _exists(self) -> bool:
        raise DataSetError("Checking if an Athena table exists using PySpark is not supported.")
I think my only question is with the handling of the SparkSession; trying to compare w/ how https://kedro.readthedocs.io/en/stable/_modules/kedro/extras/datasets/spark/spark_dataset.html#SparkDataSet does it
d
the load method is amusing but it’s functional!