Why Sidra is much more than a Data Lake | All the details

Have you heard about the data mesh paradigm? Here we explore the principles underpinning this approach and how Sidra Data Platform’s architecture leverages the promised benefits that these principles can bring to organizations.

Why Sidra is much more than a Data Lake

A few days ago, an interesting article was published about Data Mesh principles and logical architecture. This article builds on the idea that, while technology advances of the past decade have addressed the scale of volume of data and data processing, they have failed to address scale in other dimensions, like keeping up with the changes in the data landscape or the proliferation of sources of data and use cases. It suggests that some principles applied successfully to operational systems, like domain driven design or bounded contexts, may have been overlooked so far for big data platforms. As more data become available everywhere, it is more difficult to consume them all in one place under the control of one single platform and owner.

The article outlines an idea about data lakes and data warehouses as being centralized, monolithic and domain agnostic, thus failing to scale. The different failure modes are outlined in this article and include aspects like the inability to respond to new data sources or the disconnect between the data engineering team and the source teams. As such, the idea of a decentralized data mesh is introduced around four key principles:

Domain-oriented decentralized data ownership and architecture
Data as a product
Self-service data infrastructure as a platform
Federated computational governance

Based on these principles, the data mesh objective is to create a foundation for getting value from analytical data at scale. Sidra Data Platform is a unique proposition much beyond and different to the definition of a monolith data lake. Sidra allows a full end to end governed, modular, and fully scalable flow of data.

In this post, we explore how Sidra’s key architectural principles compare to the recently discussed principles exposed in the data mesh paradigm. We explore how these principles, or at least their assumed benefits, can be achieved through Sidra Data Platform while keeping the promises on streamlined management and governance.

Domain ownership

This principle concludes that the response to the traditional data silos of unreachable data is not solved by creating a single, centralized team who owns and curates the data from all the domains. Similar to the domain driven design principles, the distribution of responsibility on data should map to the teams closest to where the data is actually generated.

Sidra Data Platform is compatible with this principle, in the sense that the accelerators built for discovering and retrieving the data have been designed around an agnostic metadata hierarchy (Providers, Entities, Attributes), with the possibility to define different owners for each data provider. Also, Sidra supports several Data Storage Units (DSUs), where each DSU includes independent deployable components, including the storage and processing structural elements.

The actual data lake is then a collection of independent DSUs, which can help solve compliance, data location and separation constraints or just organizational requirements. The security and granular authorization model for Sidra allows integrating these capabilities with the organizational structure of the company.

Sidra Architecture DSU

On the side of business-specific data transformation and exploitation, the architectural principle of Sidra Client Applications is fully aligned with this principle as well. Sidra Client Applications are the independent sets of deployable services that transform and make use of the data stored in the DSU to serve a specific business case. The processing and persistence layer of Client Applications can be personalized to a myriad of scenarios, from BI classical analytics to experimentation sandboxes or knowledge mining tasks. Client Applications make use of their own copy of the data from the DSU (managed and synchronized transparently by the Sidra Core platform), its own code and metadata. All this emphasizes the principle of distributed data ownership and evolution.

Data as a product

This principle stems from the traditional high cost of discovering, trusting and using quality data. In a decentralized scenario, data as a product is presented as a way of packaging code, infra, data and metadata for each of the data domains. Handling data as a product means that there is need for domain data product owners, who are responsible for a set of success metrics around the usage of their data, as with any other product. Think about lead time for the usage of data, or data quality index, for example. The same applies to the product teams (engineers, analysts, data scientists), responsible for building and maintaining the data products.

In Sidra, we fully believe in having common infrastructure foundations, that are leveraged centrally in Sidra Core to provide common accelerators and services (preconfigured pipelines, security model, event system, management APIs, centralized monitoring, etc.). This allows having each Client Application embrace the possibility to have dedicated teams working on each domain (for example, each Client Application implementing one data domain business logic and serving one or multiple other domains), but just focus on the business-specifics. Each Client Application would represent a unit of code, infrastructure, and metadata, and at the same time benefit from centralized storage, common metadata and lineage framework, and common optimization rules (optimized storage).

This modular approach to Client Applications also makes cost attribution per domain easy. The common security authentication and authorization model of Sidra allows a decentralized, yet fully controlled management of data. This approach allows us to adapt fast to new use cases (e.g., from traditional BI to knowledge mining), including pure experimentation use cases, like Data Labs.

Sidra Client App

Additionally, Sidra Data Catalogue includes this required centralized piece of data discoverability, documentation, and lineage. Users and teams for each domain would be given permissions to access only certain DSUs or provider data if needed, thus respecting organizational boundaries around product data domains.

Self-serve data platform

Building, deploying and operating a data product requires specialized skills and infrastructure. The self-serve data infrastructure as a platform principle requires tooling that aids a domain data product developer to create, maintain and run data products as an abstraction, requiring less specialized knowledge. Thus, self-service lowers the barriers of usage and innovation.

One key dimension that has been growing with Sidra since its inception is precisely its self-service capabilities. The architecture of Sidra has been going through an evolutionary approach to facilitate the aspects of installing, setting up and creating new inbound (data connectors) and outbound (Client Applications) integrations.

Current self-service capabilities of Sidra can be organized in logical sets such as:

Capabilities for provisioning: Sidra installation and update CLI tool, pipelines auto-generation and orchestration, CI/CD pipelines and packages, infrastructure automation, etc.
Capabilities to create and build new use cases: common schema and metadata management capabilities, fully automated data orchestration within Core and from Core and Client Applications, metadata, and management APIs, build and release pipelines.
Capabilities to monitor and operate: Web interface including Data Catalogue and Authorization management, centralization of logs, centralized Application Insights, common Datawarehouse with operational insights and common repository of modules and services.

On top of this, Sidra’s upcoming releases will represent an even bigger qualitative change, with a more streamlined and user-friendly process for creating new data connectors and Client Applications from the web interface.

Federated computational governance

Finally, the fourth principle refers to the governance model that embraces decentralization and domain self-sovereignty, while it keeps being interoperable through global standardization. This standardization allows adhering to a set of global rules, that are applied to all data products and their interfaces. This principle, of course, relies not only on architecture but on a supportive organizational structure.

Making the parallelism with Sidra would require repeating some of the already mentioned points above, especially around the Client Applications paradigm and the central transversal capabilities and services, like the centralized catalogue and metadata, security model and rest of common services (e.g., ML service platform, Management UI, API management). One special mention here to the Integration Hub, which leverages Service Bus to enable messaging between Client Applications and Sidra Core or among Client Applications. Integration Hub allows a deeper synchronization and enables further composition of business use cases.

All the covered points demonstrate that Sidra is, beyond a Data Lake, a comprehensive Data Platform that is modular and future-proof.

We hope you enjoyed this article. If you have any questions on the points above, feel free to reach out to us directly at sidra@plainconcepts.com

Cookie	Duration	Description
__cfduid	1 year	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	29 days 23 hours 59 minutes	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	1 year	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
__cfduid	29 days 23 hours 59 minutes	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_ga	1 year	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gat_UA-326213-2	1 year	No description
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
_gid	1 year	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.
attributionCookie	session	No description
cookielawinfo-checkbox-analytics	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Analytics" category .
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-non-necessary	1 year	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-performance	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to store the user consent for cookies in the category "Performance".
cppro-ft	1 year	No description
cppro-ft	7 years 1 months 12 days 23 hours 59 minutes	No description
cppro-ft	7 years 1 months 12 days 23 hours 59 minutes	No description
cppro-ft	1 year	No description
cppro-ft-style	1 year	No description
cppro-ft-style	1 year	No description
cppro-ft-style	session	No description
cppro-ft-style	session	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	23 hours 59 minutes	No description
cppro-ft-style-temp	1 year	No description
i18n	10 years	No description available.
IE-jwt	62 years 6 months 9 days 9 hours	No description
IE-LANG_CODE	62 years 6 months 9 days 9 hours	No description
IE-set_country	62 years 6 months 9 days 9 hours	No description
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
wmc	9 years 11 months 30 days 11 hours 59 minutes	No description

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
sp_landing	1 day	The sp_landing is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.
sp_t	1 year	The sp_t cookie is set by Spotify to implement audio content from Spotify on the website and also registers information on user interaction related to the audio content.

Cookie	Duration	Description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjAbsoluteSessionInProgress	1 year	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	29 minutes	No description
_hjFirstSeen	1 year	No description
_hjid	11 months 29 days 23 hours 59 minutes	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	11 months 29 days 23 hours 59 minutes	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	1 year	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjid	1 year	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjIncludedInPageviewSample	1 year	No description
_hjSession_1776154	session	No description
_hjSessionUser_1776154	session	No description
_hjTLDTest	1 year	No description
_hjTLDTest	1 year	No description
_hjTLDTest	session	No description
_hjTLDTest	session	No description
_lfa_test_cookie_stored	past	No description

Cookie	Duration	Description
loglevel	never	No description available.
prism_90878714	1 month	No description
redirectFacebook	2 minutes	No description
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.