Search
  • en
  • es
  • en
    Search
    Open menu Open menu

    Introduction: Toward a Hybrid Enterprise AI Architecture

    At Plain Concepts’ Research team, we’re used to exploring innovative solutions for our clients, and in many of those cases, artificial intelligence plays a key role. Time and again, we’ve seen that its large-scale adoption faces critical challenges when relying exclusively on the cloud—namely: high recurring costsexposure of sensitive datanetwork latency, and real scalability limits in organizations with hundreds of users.

    That’s why, over the past few months, we’ve been working with some of our clients on a new hybrid architecture that combines the best of cloud AI with the power of personal devices capable of running AI models locally. This approach leverages distributed compute capacity while reducing operational costs.

    In this architecture, AI applications and models are first made available for deployment in a cloud HUB, which then performs selective deployment: heavy, complex models stay in the cloud for critical cases, while optimized models run locally for day-to-day use. Applications are then executed in a distributed way: each local device contributes inference power, removing centralized bottlenecks. Organizations also gain granular control: full governance over which data, models, and users access each resource.

    Take, for example, a media company like MediaPro, which needs to process large volumes of video and audio for tasks such as transcription, automatic subtitling, or summarization. With a hybrid architecture, the most demanding models (say, advanced semantic analysis) can remain in the cloud and be accessible only to critical roles. Meanwhile, recurring, lighter tasks (like basic transcription or content classification) can run locally, leveraging the hardware acceleration of employees’ laptops. This enables scaling across the entire organization without incurring prohibitive costs or compromising data privacy.

    This architecture is powered by Intel Core™ Ultra laptops with integrated NPUs (Neural Processing Units) and GPUs.

    These laptops with built-in accelerators enable a new hybrid architecture that combines the best of both worlds: the raw power of the cloud for mission-critical workloads, and the efficiency of edge computing for everyday use cases.

    The Role of the NPU and GPU in the New Hybrid Architectures

    This new way of thinking about hybrid computing requires personal devices capable of running AI models locally. To that end, at Plain Concepts we’ve started working with Intel Core™ Ultra laptops, taking advantage of their integrated accelerators: the NPU and the GPU.

    The arrival of Intel Core™ Ultra processors marks a qualitative leap in the ability to run artificial intelligence on personal devices. These processors integrate a heterogeneous architecture that combines three specialized compute engines: CPU, GPU, and NPU, each optimized for different types of AI workloads.

    What Is the NPU and What Is It For?

    The NPU (Neural Processing Unit) is a processor dedicated specifically to the efficient execution of AI models, particularly those that require recurring or background inferences. Its design focuses on maximizing energy efficiency, enabling sustained AI workloads without compromising device battery life. The NPU is ideal for tasks such as:

    • Background natural language processing (translation, subtitling, virtual assistants).
    • Real-time image and video analysis.
    • Continuous personalization and adaptive applications.

    As Intel engineers put it, the NPU is the system’s “marathon runner”: it handles long-duration workloads sustainably, ensuring that a laptop’s battery lasts through a full workday—even in scenarios of intensive AI use.

    Differentiating Roles: NPU vs. GPU vs. CPU

    The architecture of Intel Core Ultra processors distributes AI workloads according to their nature and performance requirements:

    • CPU: Handles fast, low-latency tasks, such as single inferences or control operations. It’s the “100-meter sprinter,” ideal for instant responses and general-purpose processing.
    • GPU: Optimized for AI workloads that require high performance over shorter periods, such as parallel processing of large data volumes (e.g., image generation, large language models). The integrated GPU (based on Intel Arc architecture) is the “hurdle runner,” tackling performance spikes and graphics-intensive workloads.
    • NPU: Specialized in sustained, low-power AI tasks, such as continuous background inference or adaptive personalization. It’s the “marathon runner,” ensuring energy efficiency and extended autonomy.
    Fast Response Performance Parallelism & Throughput Dedicated Low Power AI Engine
    Ideal for lightweight, single-inference, low-latency AI tasks Ideal for AI-infused Media/3D/Render pipelines Ideal for sustained AI and AI offload
    P-core & E-core CPU Architecture Xe2 GPU Architecture NCEs, Neural Compute Engines
    VNNI & AVX, AI Instructions XMX, Xe Matrix Extension Efficiency of matrix compute

     

    Key Benefits

    • Energy efficiency: The NPU enables continuous AI without draining the battery.
    • Flexible scalability: Each engine adapts to specific needs without additional infrastructure.
    • Full compatibility: Native support for frameworks such as OpenVINO, ONNX Runtime, and Hugging Face Optimum Intel.

    Practical Implementation: Key Tools and Frameworks

    Intel Core™ Ultra laptops can run AI models locally using compatible frameworks such as OpenVINO, ONNX Runtime, Hugging Face Optimum Intel, and Azure Foundry Local.

    Azure Foundry Local: Plug-and-Play Solution

    Foundry Local runs language models directly on the client, automatically optimizing for CPU or GPU.

    winget install Microsoft.FoundryLocal
    foundry model run phi-3.5-mini

     

    Foundry Local offers instant installation and transparent management with full privacy, though it’s currently in preview with a limited model catalog and without NPU support.

    OpenVINO: Direct Access to the NPU

    OpenVINO is the main toolkit for tapping into Intel’s NPU:

    import openvino as ov
    core = ov.Core()
    core.available_devices  # ['CPU', 'GPU', 'NPU']

    Basic NPU inference example:

    from optimum.intel import OVModelForQuestionAnswering
    model = OVModelForQuestionAnswering.from_pretrained("distilbert-base-uncased-distilled-squad-ov")
    model.to("npu")  # Run on NPU for maximum energy efficiency

    Model Compression and Optimization

    Compression is key for efficient execution on local devices. OpenVINO with NNCF enables model reduction with minimal accuracy loss:

    YOLOv8 optimization results:

    • Original model: 100% accuracy, baseline speed
    • Optimized FP16: 99.8% accuracy, 2x faster
    • Compressed INT8: 99.5% accuracy, 4x faster

     

    Performance comparison by device:

    Device Improvement with optimization Energy efficiency
    CPU 2x faster Standard
    GPU 4x faster High
    NPU 3x faster Maximum (~13W vs 20W)

     

    Development Ecosystem

    • Hugging Face Optimum Intel: Pre-optimized models for Intel hardware
    • WebNN: Hardware-accelerated AI directly in the browser, no installation required
    • ONNX Runtime: Universal compatibility with existing models

    Use Cases and Practical Examples

    Photorealistic 3D Spaces with Environmental Understanding

    In the Research team, one of the core technologies we develop is Evergine, a graphics engine focused on 3D rendering for industrial applications. A common requirement in our work is integrating AI models into Evergine applications to achieve a deeper understanding of the environment, improving both interaction and the visualization of complex data.

    As a proof of concept to test the capabilities of Intel Core Ultra laptops, we developed a photorealistic 3D environment using Gaussian Splatting to render a living room in real time, while simultaneously integrating an object detection model to identify and classify elements in the scene on the fly.

    As the user navigates the environment, the GPU handles the rendering, while the NPU runs the object detection model in parallel. This enables faster, more energy-efficient identification and classification of objects—improving both runtime performance and power consumption—all without leaving the local device.


    Virtual Assistant with Local RAG

    Intel Core Ultra laptops make it possible to run enterprise-grade virtual assistants with RAG (Retrieval-Augmented Generation) fully on-device, distributing workloads across the different accelerators:

    • LLM (GPU): Generates natural language responses
    • Embeddings (NPU): Vectorizes documents and queries
    • Interface (CPU): Manages the application and preprocessing

    Distributing workloads across the accelerators locally provides higher energy efficiency, since the NPU consumes ~13W compared to ~20W for CPU/GPU.

    Browser-Accelerated AI with WebNN

    WebNN allows AI models to run directly inside web applications using hardware acceleration, with no extra installations. This enables instant web-based deployments, complete privacy (local processing), and automatic optimization based on the available hardware.

    Examples include:

    Image generation (Stable Diffusion Turbo on GPU):

    Speech transcription (Whisper on NPU):

    Image segmentation (Segment Anything):

    Professional Accelerated Editing

    Gimp integrates AI plugins that leverage Intel accelerators for advanced tasks. These plugins allow you to dynamically select the accelerator (CPU/GPU/NPU) depending on the workload.

    Currently, three plugins are available:

    • Stable Diffusion: Integrated text-to-image generation
    • Super-resolution: Automatic quality enhancement
    • Segmentation: Selective object identification

    Consolidated Business Advantages

    The hybrid AI architecture powered by Intel AI PCs delivers clear, measurable benefits:

    Reduced Operating Costs

    Concept Cloud Execution Local Execution (Intel Core Ultra)
    Usage cost Variable (per token/hour) Zero (included in the device)
    Response latency High (network + backend) Low (on-device)
    Monthly total cost (100 users) €1,500–3,000 €0–100 (support/IT)
    Connectivity dependency Critical Optional

     

    Scalability and Control

    • Instant scalability: Every laptop adds computing power from day one
    • Granular control: Full management of models, data, and access policies
    • Enhanced security: Local processing with Intel vPro and Threat Detection protection

     

    Energy Sustainability

    • NPU efficiency: ~13W vs. ~20W for CPU/GPU under sustained workloads
    • Lower carbon footprint: Less data transfer and reduced remote processing
    • Extended autonomy: Better power management without sacrificing performance

    Considerations and Limitations

    Current Technical Limitations

    • Dynamic layers on NPU: Requires conversion to static forms, not all models are supported
    • Technical expertise: Optimization with OpenVINO and NNCF requires specialized AI engineering skills
    • Heterogeneous ecosystem: Multiple frameworks complicate integration

     

    Key Recommendations

    • Phased planning: Assess the current device fleet, prioritize critical user profiles, and focus on high-ROI use cases
    • Team training: Build skills in OpenVINO, model optimization, and Intel hardware deployment
    • Reinforced security: Leverage Intel vPro, Threat Detection, and remote management policies

    Here’s Section 7 translated into American English:

    Final Reflection

    The history of computing is marked by paradigm shifts. The arrival of the personal computer democratized technology. Today, AI is undergoing a similar transition: from the cloud to the endpoint—to the personal device.

    Intel Core Ultra laptops with NPUs represent this shift, delivering:

    • Privacy: Sensitive data stays on the device
    • Cost: Elimination of variable cloud expenses
    • Speed: Local inference without network latency
    • Autonomy: Sustainable AI through NPU-powered energy efficiency

    Companies can now build a distributed, sustainable, and cost-effective AI model by combining the power of the cloud with the efficiency of the edge.

    We are entering a new era of artificial intelligence, where innovation will depend on the ability to run AI where the data is actually generated and used: on the user’s own device.

    References

     

    Author’s Note: Part of the study and experiments that formed the basis of this article were presented in a technical talk at dotNET 2025 Madrid (pending publication), co-delivered with Ana Escobar (ana.escobar.acunas@intel.com). Many of the videos, demos, and data included in this work would not have been possible without Ana’s invaluable collaboration and expertise, for which I am especially grateful.

    Javier Carnero

    Research Manager at Plain Concepts