Hi all I m trying to use Kedro to develop a ML pipeline with Kedro #questions

Hi all. I'm trying to use Kedro to develop a ML pi...

Sebastian Cardona Lozano

02/25/2023, 1:21 AM

Hi all. I'm trying to use Kedro to develop a ML pipeline with Spark using a Dataproc cluster in GCP. I'd like to load a table from Big Query in a Spark dataset, how could I define that in the catalog? I know that I can use "plain" PySpark to read the table but I'd like to use the catalog. Thanks!

Deepyaman Datta

02/25/2023, 1:11 PM

I haven't done this personally, but if you use https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example, then you should just be able to specify

file_format: bigquery

for your

spark.SparkDataSet

Balachandran Ponnusamy

02/25/2023, 4:39 PM

@Sebastian Cardona Lozano Hi Sebastian...I am facing errors while running Kedro in Dataproc, can you pls share the spark.yml configuration to be used for dataproc...the documentation says this "You should modify this code to adapt it to your cluster’s setup, e.g. setting master to yarn" ....any help on this is much appreciated

marrrcin

02/25/2023, 9:13 PM

Maybe this repo will help you https://github.com/getindata/kedro-pyspark-dataproc-demo It's a demo of Iris Kedro Pipeline running on Dataproc Batches / Serverless Spark

👍 1

Sebastian Cardona Lozano

02/27/2023, 4:12 PM

Hi all. I'm trying to develop the pipeline of the model using Jupyter Lab IDE in the Dataproc cluster (it is a small cluster). I followed the instructions of the documentation and I created these new files: 1. `spark.yml`: Specifying the .jar with spark-bigquery connector

Copy code

spark.jars: '<gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.26.0.jar>'

2. In

src/<package_name>/

, a

hooks.py

Copy code

from kedro.framework.hooks import hook_impl
from pyspark import SparkConf
from pyspark.sql import SparkSession


class SparkHooks:
    @hook_impl
    def after_context_created(self, context) -> None:
        """Initialises a SparkSession using the config
        defined in project's conf folder.
        """

        # Load the spark configuration in spark.yaml using the config loader
        parameters = context.config_loader.get("spark*", "spark*/**")
        spark_conf = SparkConf().setAll(parameters.items())

        # Initialise the spark session
        spark_session_conf = (
            SparkSession.builder\
            .appName(context._package_name)\
            .config(conf=spark_conf)
        )
        _spark_session = spark_session_conf.getOrCreate()
        _spark_session.sparkContext.setLogLevel("WARN")

3. Updated the

HOOKS

variable in

src/<package_name>/settings.py

as follows:

Copy code

from <package_name>.hooks import SparkHooks

HOOKS = (SparkHooks(),)

4. In the

catalog.yml

I specified the Big Query table as follows:

Copy code

master_table:
    type: spark.SparkDataSet
    filepath: gcp_project_name.bigquery_dataset.table_name_in_big_query
    file_format: bigquery

When I load the table from the catalog in a Jupyter Notebook and using Python I get this:

Copy code

df = catalog.load("master_table")

[02/25/23 03:09:47] INFO     Loading data from 'master_table' (SparkDataSet)...                 data_catalog.py:343
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /opt/conda/miniconda3/envs/py39_599/lib/python3.9/site-packages/kedro/io/core.py:186 in load     │
│                                                                                                  │
│   183 │   │   self._logger.debug("Loading %s", str(self))                                        │
│   184 │   │                                                                                      │
│   185 │   │   try:                                                                               │
│ ❱ 186 │   │   │   return self._load()                                                            │
│   187 │   │   except DataSetError:                                                               │
│   188 │   │   │   raise                                                                          │
│   189 │   │   except Exception as exc:                                                           │
│                                                                                                  │
│ /opt/conda/miniconda3/envs/py39_599/lib/python3.9/site-packages/kedro/extras/datasets/spark/spar │
│ k_dataset.py:392 in _load                                                                        │
│                                                                                                  │
│   389 │   │   if self._schema:                                                                   │
│   390 │   │   │   read_obj = read_obj.schema(self._schema)                                       │
│   391 │   │                                                                                      │
│ ❱ 392 │   │   return read_obj.load(load_path, self._file_format, **self._load_args)              │
│   393 │                                                                                          │
│   394 │   def _save(self, data: DataFrame) -> None:                                              │
│   395 │   │   save_path = _strip_dbfs_prefix(self._fs_prefix + str(self._get_save_path()))       │
│                                                                                                  │
│ /opt/conda/miniconda3/envs/py39_599/lib/python3.9/site-packages/pyspark/sql/readwriter.py:177 in │
│ load                                                                                             │
│                                                                                                  │
│    174 │   │   │   self.schema(schema)                                                           │
│    175 │   │   self.options(**options)                                                           │
│    176 │   │   if isinstance(path, str):                                                         │
│ ❱  177 │   │   │   return self._df(self._jreader.load(path))                                     │
│    178 │   │   elif path is not None:                                                            │
│    179 │   │   │   if type(path) != list:                                                        │
│    180 │   │   │   │   path = [path]  # type: ignore[list-item]                                  │
│                                                                                                  │
│ /opt/conda/miniconda3/envs/py39_599/lib/python3.9/site-packages/py4j/java_gateway.py:1321 in     │
│ __call__                                                                                         │
│                                                                                                  │
│   1318 │   │   │   proto.END_COMMAND_PART                                                        │
│   1319 │   │                                                                                     │
│   1320 │   │   answer = self.gateway_client.send_command(command)                                │
│ ❱ 1321 │   │   return_value = get_return_value(                                                  │
│   1322 │   │   │   answer, self.gateway_client, self.target_id, self.name)                       │
│   1323 │   │                                                                                     │
│   1324 │   │   for temp_arg in temp_args:                                                        │
│                                                                                                  │
│ /opt/conda/miniconda3/envs/py39_599/lib/python3.9/site-packages/pyspark/sql/utils.py:190 in deco │
│                                                                                                  │
│   187 def capture_sql_exception(f: Callable[..., Any]) -> Callable[..., Any]:                    │
│   188 │   def deco(*a: Any, **kw: Any) -> Any:                                                   │
│   189 │   │   try:                                                                               │
│ ❱ 190 │   │   │   return f(*a, **kw)                                                             │
│   191 │   │   except Py4JJavaError as e:                                                         │
│   192 │   │   │   converted = convert_exception(e.java_exception)                                │
│   193 │   │   │   if not isinstance(converted, UnknownException):                                │
│                                                                                                  │
│ /opt/conda/miniconda3/envs/py39_599/lib/python3.9/site-packages/py4j/protocol.py:326 in          │
│ get_return_value                                                                                 │
│                                                                                                  │
│   323 │   │   │   type = answer[1]                                                               │
│   324 │   │   │   value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)                     │
│   325 │   │   │   if answer[1] == REFERENCE_TYPE:                                                │
│ ❱ 326 │   │   │   │   raise Py4JJavaError(                                                       │
│   327 │   │   │   │   │   "An error occurred while calling {0}{1}{2}.\n".                        │
│   328 │   │   │   │   │   format(target_id, ".", name), value)                                   │
│   329 │   │   │   else:                                                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
Py4JJavaError: An error occurred while calling o243.load.
: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Malformed project 
resource name: projects//home/bbog-gd-599-profundizacion-ml/bdb-gcp-pr-ac-ba; Expected: projects/<project_id>

Nevertheless, if I use the manually method to initialize the Spark Session and load the data, It works:

Copy code

# Initialize the SparkSession.
VER = "0.26.0"
FILE_NAME = f"spark-bigquery-with-dependencies_2.12-{VER}.jar"
connector = f"<gs://spark-lib/bigquery/{FILE_NAME}>"

spark = (
    SparkSession.builder.appName("599-produndizacion")
    .config("spark.jars", connector)
    .getOrCreate()
)

# Load data
df = spark.read.format("bigquery")\
    .option("table", "gcp_project_name.bigquery_dataset.table_name_in_big_query")\
    .load()

Sebastian Cardona Lozano

02/27/2023, 4:31 PM

Thanks for your help and sorry for the large comment 🙂

4 Views

Open in Slack

Previous Next