The cost of cleaning up data is often beyond the comfort zone of businesses full of potentially dirty data. This paves the way for reliable and compliant corporate data flows.
According to Kyle Kirwan, co-founder and CEO of data observability platform BigEye, few companies have the resources needed to develop tools for challenges such as large-scale data observability. As a result, many companies are essentially going blind, reacting when something goes wrong instead of continually addressing data quality.
A data trust provides a legal framework for the management of shared data. It promotes cooperation through common rules for data protection, confidentiality and confidentiality; and enables organizations to securely connect their data sources to a shared repository of data.
Bigeye brings together data engineers, analysts, scientists and stakeholders to build trust in data. Its platform helps companies create SLAs for monitoring and anomaly detection and ensuring data quality and reliable pipelines.
With full API access, a user-friendly interface, and automated yet flexible customization, data teams can monitor quality, consistently detect and resolve issues, and ensure that each be able to rely on user data.
uber data experience
Two early members of the data team at Uber — Kirvan and bigeye co-founder and CTO Egor Gryznov — set out to use what they learned to build Uber’s scale to build easy-to-deploy SaaS tools for data engineers. prepared for.
Kiran was one of Uber’s first data scientists and the first metadata product manager. Gryaznov was a staff-level engineer who managed Uber’s Vertica data warehouse and developed a number of internal data engineering tools and frameworks.
He realized that his team was building tools to manage Uber’s vast data lake and the thousands of internal data users available to most data engineering teams.
Automatically monitoring and detecting reliability issues within thousands of tables in a data warehouse is no easy task. Companies like Instacart, Udacity, Docker, and Clubhouse use Bigeye to make their analysis and machine learning work consistently.
a growing area
Founding Bigeye in 2019, he recognized the growing problem of enterprises deploying data in operations workflows, machine learning-powered products and services, and high-ROI use cases such as strategic analysis and business intelligence-driven decision-making.
The data observability space saw several entrants in 2021. Bigeye differentiates itself from that pack by giving users the ability to automatically assess customer data quality with over 70 unique data quality metrics.
These metrics are trained with thousands of different anomaly detection models to ensure data quality problems – even the most difficult to detect – are ahead of data engineers ever. Do not increase
Last year, data observability burst onto the scene, with at least ten data observability startups announcing significant funding rounds.
Kirwan predicted that this year, data observation will become a priority for data teams as they seek to balance the demand for managing complex platforms with the need to ensure data quality and pipeline reliability.
Bigeye’s data platform is no longer in beta. Some enterprise-grade features are still on the roadmap, such as full role-based access control. But others, such as SSO and in-VPC deployment, are available today.
The app is closed source, and hence proprietary models are used for anomaly detection. Bigeye is a big fan of open-source alternatives, but decided to develop one on its own to achieve internally set performance goals.
Machine learning is used in a few key places to bring a unique mix of metrics to each table in a customer’s connected data sources. Anomaly detection models are trained on each of those metrics to detect abnormal behavior.
Built-in three features in late 2021 automatically detect and alert data quality issues and enable data quality SLAs.
The first, deltas, makes it easy to compare and validate multiple versions of any dataset.
Issues, second, brings together multiple alerts at the same time with valuable context about related issues. This makes it easier to document past improvements and speed up proposals.
Third, the dashboard provides a holistic view of the health of the data, helps identify data quality hotspots, close gaps in monitoring coverage, and measures a team’s improvement in reliability.
eyeball data warehouse
TechNewsWorld spoke with Kirwan to uncover some of the complexities of his company’s data sniffing platform, which provides data scientists.
TechNewsWorld: What makes Bigeye’s approach innovative or cutting edge?
Kyle Kiran: Data observation requires a consistent and thorough knowledge of what is happening inside all the tables and pipelines in your data stack. It is similar to SRE [site reliability engineering] And DevOps teams use applications and infrastructure to work round the clock. But it has been repurposed for the world of data engineering and data science.
While data quality and data reliability have been an issue for decades, data applications are now important in how many major businesses run; Because any loss of data, outage, or degradation can quickly result in loss of revenue and customers.
Without data observability, data dealers must continually react to data quality issues and entanglements as they go about using the data. A better solution is to proactively identify the problems and fix the root causes.
How does trust affect data?
Ray: Often, problems are discovered by stakeholders such as executives who do not trust their often broken dashboards. Or users get confusing results from in-product machine learning models. Data engineers can better get ahead of problems and prevent business impact if they are alerted enough.
How does this concept differ from similar sounding technologies like Integrated Data Management?
Ray: Data observability is a core function within data operations (think: data management). Many customers look for best-of-breed solutions for each task within data operations. This is why technologies like Snowflake, FiveTran, Airflow and DBT are exploding in popularity. Each is considered an important part of the “modern data stack” rather than a one-size-fits-none solution.
Data Overview, Data SLA, ETL [extract, transform, load] Code version control, data pipeline testing, and other techniques must be used to keep modern data pipelines working smoothly. Just like how high-performance software engineers and DevOps teams use their collaborative technologies.
What role do data pipelines and dataops play with data visibility?
Ray: Data Observability is closely related to the emerging practice of DataOps and Data Reliability Engineering. DataOps refers to the broad set of operational challenges that data platform owners will face. Data Reliability Engineering is a part, but only part, of Data Ops, just as Site Reliability Engineering is related but does not include all DevOps.
Data security can benefit from data observation, as it can be used to identify unexpected changes in query volume on different tables or changes in the behavior of ETL pipelines. However, data observation by itself will not be a complete data protection solution.
What challenges does this technology face?
Ray: These challenges include issues such as data discovery and governance, cost tracking and management, and access control. It also includes how to handle queries, dashboards, and the growing number of ML features and models.
Reliability and uptime are certainly challenges many DevOps teams are responsible for. But they are also often charged for other aspects such as developer velocity and security reasons. Within these two areas, data overview enables data teams to know whether their data and data pipeline are error free.
What are the challenges of implementing and maintaining data observability technology?
Ray: Effective data observability systems must be integrated into the workflows of the data team. This enables them to continuously respond to data issues and focus on growing their data platform rather than putting out data fires. However, poorly tuned data observability systems can result in a flood of false positives.
An effective data system should perform more maintenance than just testing for data quality issues by automatically adapting to changes in the business. A poorly optimized data observation system, however, may not be accurate for changes in business or more accurate for changes in business that require manual tuning, which can be time-consuming.
Data observability can also be taxing on a data warehouse if not optimized properly. Bigeye teams have experience in optimizing large-scale data observation capability to ensure that the platform does not impact data warehouse performance.