Hello slightly smiling face anyone that has already implemen Kedro #questions

Hello :slightly_smiling_face: anyone that has alre...

Toni - TomTom - Madrid

07/10/2023, 9:28 AM

Hello 🙂 anyone that has already implemented a Kedro type to read Text Files as RDD in Spark? (extra points if you have even done it for XML or KML files 😉), if not, I would like to know what would be the best (simplest) way to implement this class from the existing methods/templates in Kedro. Thanks a lot in advance!

Nok Lam Chan

07/10/2023, 9:58 AM

Can you try passing the the

format

into the

load_args

def load(self, path=None, format=None, schema=None, **options):

“”"Loads data from a data source and returns it as a class`DataFrame`.

.. versionadded:: 1.4.0

Parameters

----------

path : str or list, optional

optional string or a list of string for file-system backed data sources.

format : str, optional

optional string for format of the data source. Default to ‘parquet’.

This is an excerpt from Spark documentation, we use DataFrameReader under the hood, so whatever Spark support should work out of the box.

🙌 1

Nok Lam Chan

07/10/2023, 9:58 AM

https://docs.kedro.org/en/stable/kedro.extras.datasets.spark.SparkDataSet.html

Toni - TomTom - Madrid

07/11/2023, 3:30 PM

Thanks a lot @Nok Lam Chan! Problem is that RDD is not part of SparkDataFrames, it uses SparkContext and not spark.sql for example. It is more primitive 😛, but I cannot believe that I am the first one trying to read RDDs in Kedro!

Nok Lam Chan

07/11/2023, 7:32 PM

Hopefully someone can chim in. I can have a look when I come back next week. I cannot think of anything top of my head now😛 I am sure someone have tried. Even if there is no existing implementation, I would suggest look at the SparkDataset implementation, it shouldn't be too difficult to write your custom dataset, many of the code is just to handle the path and make sure it works in different storage system and databricks. If you are interested in making a PR to add this to the datasets I am happy to help.

Nok Lam Chan

07/11/2023, 7:33 PM

How would you load it with just pure spark? Can you show me a snippet if possible?

Toni - TomTom - Madrid

07/21/2023, 8:25 AM

Hi Nok! It would be quite simple:

>> textFiles = sc.wholeTextFiles(dirPath)

My point is that it kedro lacks RDD capabilities 😅. Generally speaking this should be a must when working with Spark (not all big data is structured data). KMl files o XLM files are an example of this. Thanks a lot for your help! 🤗

thankyou 1

5 Views

Open in Slack

Previous Next