Data Architecture
Last updated
Last updated
Seraphnet’s V1 utilizes a ready-made API that refers to global data. This raw data is then processed by the data pipeline. Nextly, Seraphnet leverages existing web crawling frameworks to create web crawlers which fetch data from various news sources such as: Google News, Bing News, BBC, CNN, Yahoo, MSN, New York Times, The Washington Post etc. The raw data coming from these sources allows us to facilitate prompt engineering for the LLM integration.
Seraphnet's data sources are designed to be flexible and extensible, allowing for the integration of various new information sources as the ecosystem evolves. This plurality of sources is key to make sure the infrastructure remains ideologically neutral.
Seraphnet utilizes the Dremio Data Lakehouse architecture, built on top of Apache Iceberg, to distribute the data. The Data Lakehouse is created with the Kedro data engineering framework, a toolbox for production-ready data science.
Through its implementation of Kedro, Seraphnet's data pipeline follows a structured and modular approach. The information gathered from various sources is first processed through the pipeline store, where it undergoes data ingestion, cleaning, and preprocessing steps.
Once the data is ingested and preprocessed, it enters the pipeline analysis stage where techniques such as natural language processing, information retrieval, and data mining, (depending on the specific requirements of the Clearpills and the user's queries), are applied.
The processed and analyzed data is then fed into the LLM; here, it leverages the curated and enriched information to generate relevant and insightful responses.