Hi everyone, what kind of dataset do I create if I'm scraping data from web pages or if I'm grabbing data from RSS feeds? I have a small project I'm working on where I need to grab data from a few web sites regularly. They are mostly Microsoft notices for various products/services. I want to store the text in a mongo atlas database I have set up. I looked through the documentation but the only reference I found was for an HTTP(s) API call. Any guidance greatly appreciated 🙂
maybe I can use the API dataset. What I'm confused about is where I put the scraping logic: in a custom load() function within a custom dataset class or if I build a pipeline to handle scraping data and then storing the data. To me it makes more sense to put the logic in a pipeline. any wisdom to guide my next steps is greatly appreciated! thanks!
07/03/2023, 8:46 AM
so we typically try to decouple IO from business logic - this is the pattern you suggest where you put the logic in the pipeline. I would see how you get on with the native APIDataSet and any repeated processing logic you can use the Modular Pipeline pattern to minimise repeated code