Knowledge-Driven and Physics-Integrated: A Technical White Paper on the Next-Generation High-Quality Dataset Platform for Oil and Gas Exploration and Development

21/06/2026

Introduction: Challenges and Vision

Under the strategic guidance of China’s Three-Year Action Plan for “Data Elements ×” and the “AI+” special action for central state-owned enterprises, the oil and gas industry is undergoing a profound digital and intelligent transformation. In particular, the “Data Pulse Initiative” proposed in CNOOC’s “AI+” action plan has elevated the development of high-quality, model-ready datasets to an unprecedented strategic priority. Benchmarking against international advanced practices, such as the Penobscot public dataset and ExxonMobil’s OSDU-based DDMS, we clearly recognize that bridging the gap between “raw data” and “model-ready data” has become the critical foundation for activating data elements and driving high-quality development in the industry.

However, the upstream oil and gas business still faces significant challenges in the deep application of data, which severely constrain the depth and breadth of artificial intelligence adoption. This white paper aims to propose a new data-driven operating model for exploration and development, enabled by a next-generation platform that integrates knowledge-driven approaches with physical mechanisms. Its core objective is to systematically address the bottlenecks in data value realization and empower frontline business experts.

To achieve this vision, we must first confront the four core challenges currently faced by the oil and gas industry in building high-quality datasets:

1. Disconnection Between Data Foundations and Business Needs. Traditional IT-driven data cleansing processes are often disconnected from business practice due to a lack of deep understanding of the physical mechanisms in the oil and gas domain. For example, when processing well logging or production data, certain statistically defined “outliers” may actually contain critical geological information, such as high-pressure zones or fracture belts. Cleansing rules without physical mechanism constraints may easily misidentify and remove these key signals, resulting in data distortion. In addition, inconsistent data model standards and insufficient large-scale business validation make it difficult for the data foundation to effectively support the training and application of high-quality models.

2. Bottlenecks in High-Quality Data Annotation. Current data annotation work relies heavily on manual operations by domain experts, creating four major bottlenecks. First, efficiency is low and cannot meet the demand for massive training samples required by large-scale model training. Second, costs are high, as expert resources are scarce and valuable. Third, quality is unstable, since different experts may apply different annotation standards and interpretations. Fourth, the lack of a unified semantic model leads to inconsistent annotation criteria, making data difficult to reuse and accumulate. This has become a critical constraint on the practical implementation of AI applications.

3. Incomplete Construction Processes and Management Systems. The industry has not yet widely established a full-lifecycle process and management system covering data acquisition, governance, annotation, validation, updating, and application. As a result, many constructed datasets lack effective version management, lineage tracing, and collaboration mechanisms. They often fall into the dilemma of being “built once but difficult to reuse,” making it impossible to form a stable, efficient, and sustainable intelligent data supply capability.

4. Difficulty in Releasing Data Value. Frontline geologists and reservoir engineers in the oil and gas industry possess the richest business knowledge, but they generally lack professional capabilities in algorithm programming and model development. Faced with massive data lake assets, they have an urgent need to solve real production problems, yet often lack ready-to-use low-code or no-code modeling tools. As a result, the value of data assets cannot be fully and efficiently unlocked.

To systematically address the above challenges, Jurassic Software has designed a new platform architecture. The goal is to make complex data preparation and modeling workflows more process-oriented, intelligent, and accessible, thereby supporting this new operating model.

Overall Architecture Design of the High-Quality Dataset Platform

To effectively address the challenges outlined above, the core design philosophy of the platform architecture is “data migration plus marketplace-oriented management.” This architectural principle avoids the high risks and costs associated with replacing existing enterprise data lakes. Instead, it builds an agile, high-value intermediate layer. The platform efficiently “migrates” raw data from the data lake, processes it through a series of intelligent refinement and feature engineering workflows, and ultimately forms a trusted, traceable, and reusable “marketplace” of high-quality feature assets. It also provides business experts with an integrated low-code computing and modeling environment, significantly accelerating the time-to-market of AI applications while ensuring data governance and lineage traceability.

The logical architecture of the platform consists of four layers, each with clear and critical responsibilities:

1. Data Ingestion Layer. As the strategic entry point for data federation, this layer seamlessly integrates existing enterprise data assets. It can connect to enterprise-level data lakes and knowledge graph databases, and supports scheduled or event-triggered data migration tasks for multi-source heterogeneous data, including LAS files for well logging, SEGY files for seismic data, PDF reports, and production reports. This design ensures that the platform can continuously acquire the latest raw data while remaining decoupled from underlying data storage systems.

2. Data Processing and Feature Engineering Layer — Feature Factory. As the core processing engine of the platform, this layer is designed as a “feature factory.” It integrates four key capabilities to transform raw data into high-quality features:

(1) Unstructured Data Parsing Engine: Built-in OCR, NLP, and computer vision operators enable the automated extraction of structured information from unstructured data such as scanned PDF reports, daily reports, and core images.

(2) Physics-Based Data Cleansing Engine: Based on rule libraries constructed from domain knowledge such as fluid mechanics and geostatistics, this engine cleanses and validates data in accordance with business and physical laws.

(3) Multimodal Data Fusion Engine: This engine addresses the core challenge of granularity alignment. For example, it intelligently aligns production data with a time granularity of “days” and well logging data with a depth granularity of “meters,” thereby building a unified analytical view.

(4) Knowledge-Driven Annotation Engine: By leveraging entities and relationships accumulated in the enterprise knowledge graph, this engine enables the automated generation of sample labels, greatly reducing the cost of manual annotation.

3. Feature Asset Management Layer — Feature Store. This layer serves as the platform’s “feature marketplace” and forms the foundation for turning data assets into core competitiveness. It centrally stores the high-quality feature sets generated by the Feature Factory and provides comprehensive management capabilities. The platform supports feature version control to ensure the reproducibility of model training. It also provides rich metadata management and full-text search capabilities, allowing users to quickly discover, understand, and reuse existing feature assets. In this way, the platform breaks down data silos and eliminates repetitive data preparation across teams.

4. Algorithm Modeling and Service Layer — AI Lab & Serving. This layer serves as the carrier for making AI capabilities broadly accessible. Through a visual low-code modeling canvas, it encapsulates complex algorithm modeling processes into simple drag-and-drop operations, extending modeling capabilities from a small number of data scientists to a broad group of frontline domain experts. Users can conduct model training, evaluate model performance, and publish validated models as standard REST API services with one click, enabling downstream production systems to directly invoke them and forming a closed loop from data to value.

This four-layer architecture is designed to support an end-to-end core workflow for business experts, from scenario definition to model publishing, making the process smooth, efficient, and easy to master.

Core User Workflow: A Five-Step Approach to Self-Service Modeling

The core of this platform design is a clear and streamlined five-step user workflow tailored for geologists and reservoir engineers. This workflow represents the essence of the new operating model. Its objective is to enable the experts who best understand the business to independently and efficiently complete the entire end-to-end process from business problem definition to AI model publishing, without the need to write complex code. In this way, AI capabilities can be made truly accessible to a broader user base.

Step 1: Task Definition

User Operation: After logging into the platform, users first select a business domain of interest from predefined areas, such as exploration, development, or engineering, and then define a specific analytical task.

Platform Functionality: The platform provides a series of preset business scenario templates, such as “Lithology Identification” for classification tasks, “Single-Well Production Forecasting” for regression tasks, and “Drilling Stuck-Pipe Early Warning” for time-series anomaly detection. The system guides users to clearly define the prediction target, or target variable, such as forecasting future water cut or identifying sandstone intervals at specific depths.

Step 2: Feature Selection and Construction

User Operation: Users enter the modeling canvas and select existing high-quality features from the feature marketplace on the right-hand side through drag-and-drop operations. If the existing features are insufficient to solve the problem, users can create new features using the platform’s built-in tools.

Platform Functionality: The platform provides powerful feature construction tools. For example, users can upload a drilling daily report in PDF format, and the platform will automatically extract “drilling fluid density” through its unstructured data extraction capability. Users can also select preset physics-based cleansing rules, such as enabling “material balance validation,” allowing the system to automatically identify and cleanse data that violates physical laws. In addition, the platform can activate knowledge graph-based automatic annotation. For instance, if the system identifies from the knowledge graph that a certain well has a well test conclusion indicating an “oil-bearing interval” at a specific depth, it will automatically label the corresponding data samples at that depth as “oil-bearing interval.”

Step 3: Correlation Analysis

User Operation: After building a dataset containing dozens of candidate features, users need to identify the key factors that have the greatest impact on the prediction target, namely the dominant controlling factors.

Platform Functionality: The platform provides a range of visual decision-support tools to simplify this process. A correlation heatmap helps users identify and remove redundant features with multicollinearity. Maximum Information Coefficient, or MIC, analysis can uncover nonlinear relationships between features and the target variable. The platform can also perform a quick pre-run based on tree models, such as random forests, and present the importance ranking of each feature in an intuitive bar chart, helping experts make more informed decisions.

Step 4: AutoML and Modeling

User Operation: Users select one or more suitable algorithmic models from the platform’s “algorithm supermarket,” configure key parameters, or directly use the default values recommended by the system, and then click the “Start Training” button.

Platform Functionality: The platform’s “algorithm supermarket” integrates mainstream open-source algorithm libraries that have been validated in domain-specific scenarios, covering a wide range of applications:

1. Sequence Models, such as LSTM and Transformer: Suitable for time-dependent tasks, such as single-well production forecasting and well log curve reconstruction.

2. Classification and Regression Models, such as XGBoost and Random Forest: Suitable for static classification tasks, such as lithology identification and fracturing performance evaluation.

3. Image Models, such as CNN and U-Net: Suitable for spatial feature tasks, such as seismic facies identification and core image analysis.

At the same time, the platform has built-in automatic hyperparameter tuning, or AutoML, capabilities. Through techniques such as grid search or Bayesian optimization, the system can automatically identify the optimal combination of hyperparameters, further lowering the modeling threshold.

Step 5: Evaluation and Serving

User Operation: After model training is completed, users review the model evaluation report generated by the system. Once they confirm that the model performance meets business expectations, they click the “One-Click Publish” button.

Platform Functionality: The platform provides not only standard machine learning evaluation metrics, such as RMSE and AUC, but also business-oriented evaluation metrics, such as historical fitting rate and conformity with water-cut rising trends, making the evaluation results more aligned with real production scenarios. Once the model passes validation, the platform can package it into an independent, containerized REST API interface with one click and automatically register it with the enterprise service gateway, enabling convenient invocation by external production systems.

Behind this smooth workflow lies a series of powerful key technology modules specifically designed for the oil and gas industry.

In-depth Analysis of Key Technical Modules

This chapter provides an in-depth analysis of the two core innovative technology engines that underpin the platform’s value: the Physics-Informed Rule Engine and the KG-AutoLabeler. It also introduces the platform’s powerful multimodal data processing capabilities. These modules form the foundation for ensuring that data meets high-quality standards and that data processing workflows become more intelligent.

Physics-Informed Rule Engine

Core Value: The Physics-Informed Rule Engine is a core module for ensuring data quality and aligning data with the objective laws of geology and reservoir engineering. By embedding domain expertise and physical formulas into the data cleaning process, it fundamentally addresses the pain point of traditional IT-based data cleaning methods, which may mistakenly remove critical business information. This ensures that the data input into models is scientific and trustworthy from the very beginning.

Implementation Mechanism: The engine contains an extensible rule library covering common physical constraints in oil and gas exploration and development. Users can enable these rules through simple selection or configuration.

Range Constraints: Reasonable value ranges are defined for physical parameters. For example, the porosity φ of sandstone typically falls between 0 and 40%: 0 < φ < 40%.

Trend Constraints: Mandatory trends of data changes with respect to a specific variable, such as depth or time, are defined. For instance, the cumulative depth values in the water intake profile of an injection well must increase monotonically as well as depth increases.

Mechanism Constraints: Validation is performed based on classical engineering formulas or physical laws. For example, bottom-hole flowing pressure Pwf must be greater than zero: Pwf > 0, and the daily water production of a single well must not exceed its daily liquid production.

User-defined Rules: The platform allows users to define new business validation rules in a simple and intuitive way, similar to Excel formulas. For example:
IF(Col_A > Col_B * 1.5, "Anomaly", "Normal")

This provides business experts with a high degree of flexibility.

KG-AutoLabeler

Core Value: This module is designed to fundamentally address the key challenges of traditional data labeling, which relies heavily on expert participation, suffers from low efficiency, and involves high costs. By leveraging the knowledge assets already accumulated within the enterprise, the module enables automated and efficient labeling of large-scale data.

Implementation Principle: Its core principle is to automatically generate labels by using the structured relationships between existing entities and geological events in the knowledge graph. The knowledge graph connects information such as wells, layers, and faults in a graph-based structure, enabling machines to reason in a manner similar to domain experts.

Workflow: Taking lithology labeling of well logging curves as an example:

1. Input: The platform receives a well logging dataset to be labeled, containing well names and depth information.

2. Query: The system automatically queries the knowledge graph to determine whether relevant geological events exist for the corresponding depth interval of the well, such as “well test conclusions” or “mud logging conclusions.”

3. Mapping and Backfilling: If the knowledge graph returns a conclusion such as “oil-bearing layer,” the system automatically backfills the corresponding label into the dataset based on predefined mapping rules. For example, “oil-bearing layer” may be mapped to the numerical label 1, thereby completing the automatic labeling process.

Multimodal Unstructured Data Parser

Core Value: The core mission of this module is to activate the massive amount of unstructured and semi-structured data stored in enterprise file systems, such as PDF production daily reports, technical reports, and core or thin-section images in BMP/JPG formats. It converts these data assets into structured features that can be calculated, analyzed, and used by models.

Implementation Approach: For different types of data, the module adopts dedicated parsing pipelines.

1. OCR/NLP Pipeline: For scanned PDF reports, this pipeline converts unstructured textual content into structured key-value information through steps such as image enhancement, table detection, OCR recognition, and entity extraction. For example, it can accurately extract key entities and values such as “oil pressure,” “casing pressure,” and “liquid production” from production logs.

2. Image Feature Extraction: For image data such as core photos and thin-section images, the platform uses deep learning networks pretrained on large-scale image libraries, such as ResNet, to convert each image into a high-dimensional mathematical vector, also known as an embedding. This vector condenses the core visual features of the image, allowing the model to treat the visual texture of rocks as a mathematical feature and thereby identify the intrinsic relationships between visual patterns and production results.

The implementation of these key technical modules relies on a carefully selected modern technology stack, primarily based on open-source technologies.

Platform Application Value and Future Outlook

The high-quality dataset platform based on knowledge-driven and physics-integrated methodologies, as described in this white paper, is far more than a collection of tools. It represents a methodology and practical framework designed to advance the digital and intelligent transformation of oil and gas exploration and development. By systematically addressing the full-chain bottlenecks from data to value, the platform will bring profound and lasting transformation to enterprises.

The platform’s four core application values can be summarized as follows:

Data Assetization

Through intelligent parsing, cleaning, and integration, the platform transforms raw data in data lakes that are difficult to understand and use into high-value feature assets that are computable, analyzable, and reusable. The resulting enterprise-level feature marketplace enables data to be truly accumulated as a measurable, manageable, and continuously value-enhancing core asset.

Business Efficiency

By introducing disruptive technologies such as automated cleaning based on physics-informed constraints and automated labeling based on knowledge graphs, the platform can significantly shorten the preparation cycle for building a high-quality scenario dataset. What traditionally required several months of manual effort can be reduced to only a few days, greatly accelerating the iteration and deployment of AI applications.

Capability Democratization

Through a low-code/no-code visual modeling interface, the platform empowers frontline geologists and reservoir engineers, who best understand the business, with advanced data science and AI modeling capabilities. This enables them to independently build big data models and solve real production problems in a data-driven manner, thereby maximizing the multiplier effect of expert knowledge and data value.

Scientific Decision-making

By deeply embedding geological and reservoir engineering principles, namely physical mechanisms, into the entire data processing workflow, the platform ensures that the data used for model training conforms to scientific common sense from the source. This not only significantly improves model robustness, interpretability, and prediction accuracy but also provides a solid scientific foundation for model-generated decision recommendations, thereby enhancing the reliability of intelligent decision-making.

Future Outlook

Looking ahead, the platform will evolve beyond a modeling tool toward a more expansive Intelligent Agent Ecosystem. Within this ecosystem:

Each trained model will be encapsulated as a reusable Business Operator, such as a lithology identification operator or a production prediction operator.

Higher-level agents will be able to automatically orchestrate and invoke combinations of these Business Operators based on complex, multi-step geological and reservoir engineering problems. In this way, they can autonomously complete comprehensive analytical tasks, enabling a leap from decision support to autonomous analysis.

At the same time, the platform will actively explore Federated Learning technologies to enable cross-institutional and cross-regional collaborative modeling while ensuring data privacy and security. This will help address the challenge of data silos. In addition, the platform will build a richer visual interaction environment, enabling deep integration between model results and professional maps.

We firmly believe that the development and application of this platform will provide strong momentum for the intelligent upgrading of the oil and gas industry, ushering in a new era driven by data and empowered by knowledge.

Share the Post: