Hybrid architectures for Web3 data infrastructures, an exploration based on the a16z article
Web3 数据基础设施的混合架构:基于 a16z 文章的探索
The popularity of Chat GPT and GPT-4 has shown us the power of artificial intelligence. Behind artificial intelligence, in addition to algorithms, what is more, important is massive data. Around the data, we have built a large-scale complex system, the value of which mainly comes from business intelligence (Business Intelligence, BI) and artificial intelligence (Artificial Intelligence, AI). Due to the rapid growth of data volumes in the Internet age, data infrastructure work, and best practices are also evolving rapidly. In the past two years, the core system of the data infrastructure technology stack has been very stable, and the supporting tools and applications are also growing rapidly.
Web2 Data Infrastructure Architecture
Cloud data warehouses (e.g., Snowflake, etc.) are growing rapidly, primarily focusing on SQL user and business intelligence user scenarios. Adoption of other technologies is also accelerating, with data lakes (e.g., Databricks) experiencing unprecedented customer growth, and heterogeneity in the data technology stack will coexist.
Other core data systems, such as data acquisition and transformation, have proven equally durable. This is particularly evident in the modern data intelligence space, where combinations of Fivetran and dbt (or similar technologies) can be found almost everywhere. The combination of Databricks/Spark, Confluent/Kafka, and Astronomer/Airflow is also becoming the de facto standard.
- Data Sources end generates relevant business and operational data;
- Data Extraction and Transformation responsible for extracting data from business systems (E), transferring to storage, aligning formats between data sources and destinations (L) and sending analyzed data back to business systems as required;
- Data storage storing data in a format that can be queried and processed, requiring optimization toward low cost, high scalability and analytical effort;
- Querying and processing translates high-level programming languages (typically in SQL, Python, or Java/Scala) into low-end data processing tasks. Executing queries and data models based on stored data using distributed computing, including historical analysis (describing past occurrences) and predictive analysis (describing future expected events);
- Transformation Converting data into analytically usable structures, managing processes and resources;
- Analysis and Output is an interface for analysts and data scientists to provide traceable insight and collaboration, present the results of data analysis to internal and external users, and embed data models into user-facing applications.
With the rapid development of the data ecosystem, the concept of "data platforms" has emerged. From an industry perspective, the defining characteristic of a platform is the technical and economic interdependence of an influential platform provider and many third-party developers. From a platform perspective, the data technology stack is divided into a "front-end" and a "back-end."
"Back-end" roughly includes data extraction, storage, processing, and transformation, and has begun to consolidate around several cloud service providers. As a result, customer data is collected in a standard system, and vendors are investing heavily in making this data easily accessible to other developers. This is also a fundamental design principle of systems like Databricks and is implemented through systems like the SQL standard and custom compute APIs like Snowflake.
"Front-end" engineers leverage this single integration point to build various new applications. **They rely on data that has been cleaned and integrated into a data warehouse/lake warehouse without worrying about how they were generated. A single customer can build and buy many applications on top of one core data system. ** We are even starting to see traditional enterprise systems, such as financial or product analytics, being refactored using warehouse-native architectures.
As the data technology stack gradually matures, the data applications on the data platform also surge. Due to standardization, adopting new data platforms has never been more important, and correspondingly maintaining platforms has become extremely important. At scale, platforms can be very valuable. There is intense competition among core data system vendors for current budgets and long-term platform positions. The astonishing valuations of data ingestion and transformation companies are easier to understand if you consider that data ingestion and transformation modules are a core part of emerging data platforms.
However, these technology stacks have been shaped by a data utilization approach dominated by large companies. As society's understanding of data deepens, it is believed that data, like land, labor, capital, and technology, are all marketable factors of production. As one of the five factors of production, it is the asset value of data reflected behind it.
The current technology stack is inadequate to enable the allocation of data element markets. New data infrastructures are developing and evolving in Web3, which is closely integrated with blockchain technology. These infrastructures will be embedded in modern data infrastructure architectures to enable data ownership rights definition, circulation transactions, revenue distribution, and factor governance. These four areas are critical from a government regulatory perspective and require special attention.
Web3 Hybrid Data Infrastructure Architecture
Inspired by the a16z Unified Data Infrastructure Architecture (2.0) and incorporating our understanding of the Web3 infrastructure architecture, we propose the following Web3 Hybrid Data Infrastructure Architecture.
Orange is the technology stack unit that is unique to Web3. Since decentralization is still in its early stages of development, most applications in the Web3 space are still using this hybrid data infrastructure architecture. The vast majority of applications are not true "superstructures". Hyperarchitecture is a non-stoppable, free, valuable, scalable, license-free, positive externality, and trustworthy neutrality. It exists as a public good for the digital world, public infrastructure for the "metaverse" world. This requires a completely decentralized underlying architecture to support it.
The traditional data infrastructure architecture evolved in response to the business development of the enterprise. a16z summarizes it in two systems (analytics and business systems) and three scenarios (modern business intelligence, multi-model data processing, artificial intelligence, and machine learning). This is a summary made from the perspective of the business - data for the growth of the business.
However, not only enterprises, society, and individuals should benefit from the productivity improvement from the data element. Countries worldwide have introduced policies and regulations one after another, hoping to regulate the use of data from the regulatory level and promote the circulation of data. This includes various data banks that are common in Japan, data exchanges that have recently emerged in China, and trading platforms that have been widely used in Europe and the United States, such as BDEX (USA), Streamr (Switzerland), DAWEX (France) and CARUSO, etc.
As data begins to be titled, traded in flow, distributed, and governed, their systems and scenarios go beyond empowering companies to make decisions and grow their businesses themselves. These systems and scenarios either need to leverage blockchain technology or strongly rely on policy regulation. web3 is a natural ground for data factor markets, which technically eliminates the possibility of cheating and can greatly reduce regulatory pressure, allowing data to existing as a true factor of production and be allocated in a market-based manner.
In the Web3 context, the new paradigm of data utilization includes market systems that host mobile data elements and public systems that manage public data elements. They cover three new data business scenarios: property data development integration, composable initial data layers, and public data mining.
Some of these scenarios are tightly integrated with traditional data infrastructures and belong to Web3 hybrid data infrastructure architectures. In contrast, others are detached from traditional architectures and are fully supported by new technologies native to Web3.
Web3 and the Data Economy
The data economy marketplace is the key to allocating data elements, including developing and integrating product data and the initial data layer market with composability. In an efficient and compliant data economy market, the following points are important:
- data ownership rights are key to securing rights and compliant use and should be disposed of in a structured allocation, while data use requires confirmation of authorization mechanisms. Each participant should have the relevant rights and interests.
- Circulation transactions must be combined on and off-site as well as compliant and efficient. It should be based on four principles: data source can be confirmed, use scope can be defined, circulation process can be traced, and security risk can be prevented.
- Revenue distribution system needs to be efficient and fair. According to the principle of "who inputs, who contributes, who benefits", the government can guide and regulate the distribution of data elements.
- Factor governance is secure, controllable, flexible, and inclusive. This requires an innovative government data governance mechanism, the establishment of a data elements market credit system, and encourage enterprises to actively participate in the construction of a data elements market around data sources, data ownership rights, data quality, data use, etc., the implementation of the data circulation transaction statement and commitment system for data vendors and third-party professional service organizations.
The above principles are the basic principles for regulators to consider the data economy. These principles can be used to think about three scenarios: property data development and integration, composable initial data layers, and public data mining. What kind of infrastructure do we need to support this? What kind of value can these infrastructures capture at what stages?
Scenario 1: Data ownership rights development and integration
In the process of property rights data development, it is necessary to establish a categorical and hierarchical rights confirmation and authorization mechanism to determine the ownership, use rights, and management rights of public, corporate, and personal data. According to the data source and generation characteristics, the property rights of data are defined through "data adaptation". Among them, typical projects include Navigate, Streamr Network, KYVE, etc. These projects realize data quality standardization, data collection, and interface standardization through technical means, confirm the rights of off-chain data in some form, and carry out data classification and hierarchical authorization through smart contracts or internal logic systems.
The applicable data types in this scenario are non-public data, namely enterprise and personal data. The value of data elements should be activated by "common use and shared benefits" in a market-oriented manner.
- Enterprise data includes all kinds of market subjects in the production and business activities collected and processed data that do not involve personal information and public interests. Market subjects enjoy the rights and interests to hold, use and obtain benefits in accordance with the law, as well as the right to receive reasonable returns for their input labor and other factor contributions.
- Personal data requires data processors to collect, hold, host, and use data per the scope of individual authorization and the law. Use innovative technical means to promote the anonymization of personal information and to safeguard the security of information and personal privacy when using personal information data. Explore mechanisms for trustees to represent the interests of individuals and supervise the collection, processing, and use of personal information data by market entities. For special personal information data related to national security, the use of relevant units may be authorized in accordance with the law and regulations.
Scenario 2: Combinable initial data layers
Composable initial data layers are an important part of the data economy market. Unlike general property rights data, the most obvious feature of this part of data is that the standard data format needs to be defined through "data schema management". Different from the quality, collection, and interface standardization of "data adaptation", the emphasis here is on the standardization of data models, including standard data formats and standard data models. Ceramic and Lens are the pioneers in this field. They respectively guarantee the standard mode of off-chain (decentralized storage) and on-chain data, thus making data composable.
Built on top of these data schema management tools are composable initial data layers, often called "data layers", such as Cyberconnect, KNN3, etc.
The combinable initial data layers are less involved in the Web2 technology stack, but ceramic-based hot data reading tools break this point, which will be a critical breakthrough. Many similar data do not need to be stored on the blockchain, and storing them on the blockchain is difficult. Still, they need to be stored on a decentralized network, such as high-frequency low-value density such as user posts, likes, and comments Data, and Ceramic provides a storage paradigm for this type of data.
Composable initial data is a key scenario for innovation in the new era, and it is also an important symbol of the end of data hegemony and monopoly. It can solve the cold start problem of start-ups in terms of data, combining mature data sets and new data sets so that Startups can build a data competitive advantage faster. At the same time, it lets start-ups focus on incremental data value and freshness to win continuous competitiveness for their innovative ideas. This way, large data will not become a moat for large companies.
Scenario 3: Public Data Mining
Common data mining is not a new use case but has received unprecedented prominence in the Web3 technology stack.
Traditional public data includes public data generated by party and government agencies, enterprises, and institutions performing their duties according to law or providing public services. Regulatory agencies encourage providing such data to society in the form of models, verification, and other products and services per the requirements of "original data not out of domain, data usable but not visible" on the premise of protecting personal privacy and ensuring public safety. They use traditional technology stacks (blue and some orange, orange represents the intersection of multiple types of technology stacks, the same as below).
In Web3, the transaction data and activity data on the blockchain are another type of public data, which is characterized by "available and visible", so it lacks data privacy, data security, and data use confirmation authorization capabilities, which is a real " Public goods" (Public Goods). They use a technology stack (yellow and partially orange) with blockchain and smart contracts at its core.
The data on decentralized storage is mostly Web3 application data other than transactions. At present, it is mainly file and object storage, and the corresponding technology stack is still immature (green and some orange). The production and mining of this type of public data utilize storage common issues, including hot and cold storage, indexing, state synchronization, rights management, computation, etc.
Many data applications have emerged in this scenario. They are not data infrastructure but more data tools, including Nansen, Dune, NFT Scan, 0x Scope, etc.
Case: Data Exchange
Data exchange is a platform where data is traded as a commodity. They can be categorized and compared based on transaction objects, pricing mechanisms, quality assurance, etc. Data Stream X, Dawex, and Ocean Protocol are typical market data exchanges.
Ocean Protocol (200 million market cap) is an open-source protocol that enables businesses and individuals to exchange and liquidate data and data-based services. The protocol is based on the Ethereum blockchain and uses "datatokens" to control access to datasets. A data token is a special ERC20 token representing ownership or access to a dataset or a data service. Users can purchase or earn data tokens to access the information they need.
The technical architecture of Ocean Protocol consists of the following main components:
- Providers: Providers providing data or data services can earn revenue by issuing and selling their own data passes through Ocean Protocol.
- Consumers: Demanders who buy and use data or data services can purchase or earn the data passes they need to gain access through Ocean Protocol.
- Marketplaces: An open, transparent, and fair marketplace for data transactions provided by Ocean Protocol or a third party that connects providers and consumers globally and offers data passes in multiple types and domains. Marketplaces can help organizations discover new business opportunities, increase revenue streams, optimize operational efficiency, and create value.
- Network: refers to a decentralized network layer provided by Ocean Protocol that supports data exchange of different types and sizes and ensures security, trust, and transparency in the data transaction process. The network layer is a set of smart contracts that are used to register data, record ownership information, facilitate secure data exchange, etc.
- Curator (Curator): refers to a role in an ecosystem responsible for screening, managing, and reviewing datasets. They are responsible for reviewing information about the dataset's source, content, format, and license to ensure that the dataset meets standards and can be trusted and used by other users.
- Verifier: A role in an ecosystem that verifies, vets data transactions and data services, and reviews and validates transactions between data service providers and consumers to ensure the quality, availability, and accuracy of data services.
"Data services" data providers create include data, algorithms, computation, storage, analysis, and curation. These components are tied to the service's execution agreements (service-level agreements), secure computing, access control, and permissions. Essentially, this controls access to a "cloud service suite" through smart contracts.
The advantage is that,
- Open-source, flexible, and extensible protocols help organizations and individuals create unique data ecosystems.
- The decentralized network layer based on blockchain technology can ensure the security, credibility, and transparency of the data transaction process while protecting providers' and consumers' privacy and rights.
- An open, transparent, and fair data market that can connect providers and consumers worldwide and provide data certificates of various types and fields.
Ocean Protocol is a typical representative of hybrid architecture. Its data can be stored in different places, including traditional cloud storage services, decentralized storage networks, or the data provider's own servers. The protocol uses data tokens (datatokens) and non-homogeneous tokens (data NFTs) to identify and manage data ownership and access rights. In addition, the protocol also provides the function of computing to data (compute-to-data), enabling data consumers to analyze and process data without exposing the original data.
Although Ocean Protocol is one of the complete data trading platforms on the market at this stage, it still faces many challenges:
- Establish an effective trust mechanism to increase the trust between data providers and demanders and reduce transaction risks. For example, establish a data element market credit system, and carry out certification and verification through the blockchain to identify untrustworthy behaviors in data transactions, incentives for keeping promises, punishment for untrustworthiness, credit restoration, objection handling, etc.
- Establish a reasonable pricing mechanism to reflect the true value of data products, incentivize data providers to provide high-quality data, and attract more demanders.
- Establish a unified standard specification to facilitate interoperability and compatibility between data of different formats, types, sources, and purposes.
Case: Data Model Marketplace
Ceramic mentions in their Data Universe the open data model marketplace they want to create because data needs interoperability and it can contribute significantly to productivity gains. Such a data model marketplace is achieved through an urgent consensus on the data model, similar to the ERC contract standard in Ether, from which developers can choose a functional template to have an application that conforms to all the data in that data model. At this stage, such a marketplace is not a trading marketplace.
Regarding the data model, a simple example is that in a decentralized social network, the data model can be reduced to 4 parameters, which are
PostList
: stores the index of user postsPost
: stores individual postsProfile
: stores the user's profileFollowList
: stores the user's follow list
So how can data models be created, shared, and reused on Ceramic to enable cross-application data interoperability?
Ceramic provides a Data Models Registry (Data Models Registry), an open source, community-built, reusable application data model for the Ceramic repository. Here, developers can openly register, discover, and reuse existing data models—the foundation for customer operations applications built on shared data models. Currently, it is based on Github storage, and in the future, it will be decentralized on Ceramic.
All data models added to the registry are automatically published under the npm
package of @datamodels. Any developer can install one or more data models using @datamodels/model - name
. These models are available for storing or retrieving data at runtime using any IDX client, including DID Data Store or Self.ID.
In addition, Ceramic has built a Data Models forum based on GitHub. Each model in the data model registry has its own discussion thread on this forum. The community can comment and discuss it. At the same time, it is also a place for developers to publish their ideas on data models so as to obtain the opinions of the community before adding them to the registry. At present, everything is in the early stage. There are not many data models in the registry. The data models included in the registry should be evaluated by the community and called the CIP standard, just like the smart contract standard of Ethereum, which provides data composability.
Case: Decentralized Data Warehouse
Space and Time is the first decentralized data warehouse to connect on-chain and off-chain data to support a new generation of smart contract use cases. Space and Time (SxT) has the industry's most mature blockchain indexing service. The SxT data warehouse also employs new cryptography called The SxT data warehouse also employs new cryptography called Proof of SQL™ to generate verifiable, tamper-proof results, allowing developers to join untrusted on-chain and off-chain-data-in a simple SQL format and load the results directly into smart contracts, supporting sub-second queries and enterprise-level analytics in a fully tamper-proof and blockchain-anchored manner.
Space and Time is a two-tier network with a validator layer and a data warehouse. the success of the SxT platform depends on the seamless interaction of the validator and the data warehouse to facilitate simple and secure querying of both on-chain and off-chain data.
- A data warehouse consists of database networks and computes clusters controlled by and routed to them by space and time validators. Space and time adopt a flexible storage solution: HTAP (Hybrid transactional/analytic processing).
- The validator monitors, commands, and validates the services provided by these clusters, then orchestrates the data flow and queries between end users and data warehouse clusters. Validators provide a means for data to enter a system (such as a blockchain index) and data to exit a system (such as a smart contract).
- Routing - supports transactional and query interactions with a network of decentralized data warehouses
- Streaming - acts as a sink for high-volume customer streaming (event-driven) workloads
- Consensus - Provides high-performance Byzantine Fault Tolerance for data entering and exiting the platform
- Query Proof - provide SQL proof to the platform
- Table Anchor - provide proof of storage to the platform by anchoring the table on the chain
- Oracle - supports Web3 interactions, including smart contract event listening and cross-chain messaging/relaying
- Security - preventing unauthenticated and unauthorized access to the platform
Space and Time as a platform is the world's first decentralized data structure, opening up a powerful but underserved market: data sharing. Within the Space and Time platform, companies can share data freely and use smart contracts to trade the shared data. Additionally, datasets can be proof-of-sql monetized and aggregated without giving consumers access to the raw data. Data consumers can trust that aggregations are accurate without seeing the data itself, so data providers no longer have to be data consumers. It is for this reason that the combination of SQL proofs and data structure schemas has the potential to democratize data manipulation, as anyone can contribute to ingesting, transforming, and serving datasets.
Web3 Data Governance and Discovery
Currently, the Web3 data infrastructure lacks a practical and efficient data governance architecture. However, a practical and efficient data governance infrastructure is essential to configure the data elements of each participant's relevant rights and interests.
- For the data source, it is necessary to have informed consent and the right to acquire, copy and transfer the data itself freely.
- For data processors, they need to have the right to self-control, use data and obtain benefits.
- For data derivatives, operating rights are required.
Currently, Web3 data governance capabilities are single, assets and data (including ceramics) can only be controlled by controlling private keys, and almost no hierarchical classification configuration capability exists. Recently, the innovative mechanisms of Tableland, FEVM, and Greenfield can achieve trustless data governance to a certain extent. Traditional data governance tools such as Collibra can only be used within the enterprise and only have platform-level trust. At the same time, non-decentralized technology also makes it impossible to prevent personal evil and single points of failure. Through data governance tools such as Tableland, the security technologies, standards, and solutions required for the data circulation process can be guaranteed.
Case: Tableland
Tableland Network is a decentralized web3 protocol for structured relational data, starting with Ethereum (EVM) and EVM-compatible L2. With Tableland, it is now possible to implement traditional web2 relational database functionality by leveraging the blockchain layer for access control. However, Tableland is not a new database - just web3 native relational tables.
Tableland offers a new way to enable dapps to store relational data in web3-native networks without these tradeoffs.
Solutions
With Tableland, metadata can be mutated (with access controls if needed), queried (using familiar SQL), and composable (with other tables on Tableland) - all in a fully decentralized manner.
Tableland breaks down a traditional relational database into two main components: an on-chain registry with access control logic (ACL) and an off-chain (decentralized) table. Every table in Tableland is initially minted as an ERC721 token on a base EVM compatibility layer. Thus, on-chain table owners can set ACL permissions on the table, while off-chain, the Tableland network manages the creation and subsequent alteration of the table itself. Linking between on-chain and off-chain is handled at the contract level. It points to the Tableland network (using base URI + token URI, much like many existing ERC721 tokens that use IPFS gateways or escrow servers for metadata).
Only people with the appropriate on-chain privileges can write to a particular table. However, table reads do not have to be on-chain operations and can use the Tableland gateway; therefore, read queries are free and can come from simple front-end requests or other non-EVM blockchains. To use Tableland, a table must first be created (i.e., cast on the chain as ERC721). The deployment address is initially set to the table owner, and this owner can set permissions for any other users who try to interact with the table to make changes. For example, the owner can set rules for who can update/insert/delete values, what data they can change, and even decide if they want to transfer ownership of the other side of the table. In addition, more complex queries can join data from multiple tables (owned or unowned) to create a fully dynamic and composable relational data layer.
Consider the following diagram, which summarizes a new user's interaction with a table that has been deployed to Tableland by some dapp:
Here is the overall flow of information:
- A new user interacts with the dapp's UI and tries to update some information stored in a Tableland table.
- The dapp calls the Tableland registration smart contract to run this SQL statement, and this contract checks the dapp's smart contract, which contains a custom ACL that defines the permissions of this new user. There are a few points to note:
- Custom ACLs in separate smart contracts for dapps are a completely optional but advanced use case; developers don't need to implement custom ACLs and can use the default policy of the Tableland Registry smart contract (only the owner has full permissions).
- You can also use Gateway to write queries instead of calling Tableland smart contracts directly. There will always be an option for dapps to call Tableland smart contracts directly, but any query can be sent through the Gateway, which will relay the query to the smart contract on a subsidized basis.
- The Tableland smart contract takes the user's SQL statements and permissions, incorporating these into emitted events that describe the SQL-based actions to be taken.
- The Tableland Validator node listens to these events and then takes one of the following actions:
- If the user has the correct permissions to write to the table, the validator will run SQL commands accordingly (insert a new row into the table or update an existing value) and broadcast the confirmation data to the Tableland network.
- If the user does not have the correct permissions, the Validator will not perform any operations on the table.
- If the request is a simple read query, the corresponding data is returned; Tableland is a completely open relational data network where anyone can perform read-only queries on any table.
- Dapps can reflect any updates on the Tableland network via the Gateway.
(Usage scenario) What to avoid
- Personally identifiable data - Tableland is an open network; anyone can read data from any table. Therefore, personal data should not be stored in Tableland.
- High-frequency, sub-second writes - e.g., high-frequency trading robots.
- Storing every user interaction in the application - keeping this data in Web3 tables, such as keystrokes or clicks, may not make sense. The frequency of writes can lead to high costs.
- Very large data sets - These should be avoided and are best handled through file storage, using solutions such as IPFS, Filecoin, or Arweave. However, pointers to these locations and associated metadata are actually a good use case for Tableland tables.
Thoughts on valuation capture
The different units have an irreplaceable role in the overall data infrastructure architecture, and the value of their value capture is mainly reflected in the market value/valuation and estimated earnings, which can be obtained from the following conclusions:
- Data Source is the module with the largest value capture in the whole architecture
- Data Replication, Transformation, Streaming, and Data Warehousing are the next most important
- Analytics layer may have good cash flow, but there will be an upper limit to valuation
Companies/projects on the left of the structure chart tend to capture greater value.
Industry Concentration
According to incomplete statistical analysis from the companies in the graph above, industry concentration is judged as follows:
- the highest industry concentration is data storage and data query and processing two modules
- the industry concentration is medium is data extraction and conversion
- the two modules with low industry concentration are data source, analysis, and output
The industry concentration of data sources, analysis, and output is low. The preliminary judgment is that different business scenarios lead to the emergence of vertical scenario leaders in each business scenario, such as Oracle in the database, Stripe in third-party services, Salesforce in enterprise services, Tableau in dashboard analysis, Sisense in embedded analysis, etc.
The reason for the moderate industry concentration of data extraction and conversion modules is initially judged to be due to the technology-oriented nature of the business attributes. The modular form of middleware also makes the switching cost relatively low.
The data storage and query and processing modules with the highest industry concentration are initially judged to be due to the single business scenario, high technical content, high start-up cost, and high cost of subsequent switching, which gives the company/project a strong first-mover advantage and network effect.
Thoughts on business models and exit paths for data protocols
Judging from the time of establishment and listing,
- Most companies/projects established before 2010 are data source companies/projects. The mobile Internet has not yet emerged, and the data is not very large. There are also some data storage and analysis output projects, mainly dashboards.
- From 2010 to 2014, on the eve of the rise of the mobile Internet, data storage and query projects such as Snowflake and Databricks were born, data extraction and conversion projects also began to appear, and a set of mature big data management technology solutions gradually improved. There are a large number of analysis output projects, mainly dashboards.
- From 2015 to 2020, query and processing projects have sprung up, and many data extraction and conversion projects are also emerging so that people can better exert the power of big data.
- From 2020 onwards, newer real-time analytics databases and data lake solutions emerged, such as Clickhouse and Tabular.
Infrastructure improvement is the premise of the so-called "mass adoption". During the period of large-scale application, there are still new opportunities. Still, these opportunities are almost only "middleware", and the underlying data warehouse, data source, and other solutions are almost a winner-takes-all situation unless there is a technically substantial Sexual breakthrough. Otherwise, it will be difficult to grow up.
And analysis output projects are opportunities for entrepreneurial projects no matter what period they are in. But it is also constantly iteratively innovating and doing new things based on new scenarios. Tableau, which appeared before 2010, occupied most of the desktop dashboard analysis tools. New scenarios emerged later, such as more professional-oriented DS/ML tools, A more comprehensive data workstation, and a more SaaS-oriented embedded analysis.
Looking at the current data protocol of Web3 from this perspective:
- The data source and storage projects are not yet determined, but the leaders are emerging. The state storage on the chain is led by Ethereum (220 billion market value), while the decentralized storage is led by Filecoin (2.3 billion market value) and Arweave (280 million market value). There may be a sudden emergence of Greenfield. ——Highest value capture
- There is still room for innovation in data extraction and conversion projects. The data oracle machine Chainlink (market value of 3.8 billion) is just the beginning. Event streaming and stream processing infrastructure Ceramic and more projects will appear, but there is not much space. - Moderate value capture
- For query and processing projects, the Graph (with a market value of 1.2 billion) has been able to meet most of the needs, and the type and quantity of projects have not yet reached the explosive period. - Moderate value capture
- Data analysis projects, mainly Nansen and Dune (valued at 1 billion), need new scenarios for new opportunities. NFT Scan and NFT Go are similar to new scenarios, but they are only content updates, not analysis logic New requirements at the /paradigm level. — Moderate value capture, strong cash flow.
But Web3 is not a copy of Web2 nor a complete evolution of Web2. Web3 has very original missions and scenarios, thus giving birth to business scenarios that are completely different from before (the first three scenarios are all abstractions that can be made now).