Common Spryker Projects Issues Overview

10.2024

Performance Spryker Modules and Updates Implementation and Integrations Infrastructure

Performance

Excessive Data in CustomerTransfer object

By default, the CustomerTransfer object is filled with the entire customer database record, including sensitive fields such as password hashes. This full object is then stored in the user session.

Results:

– Excessive payload between Yves and Zed leads to slower communication and increased network overhead.

– Unnecessary and unsafe exposure of authentication-related data — password hashes and other internal fields are transmitted to the frontend, creating security risks.

– XSS-related WAF blocking — some password hashes may contain character sequences that trigger WAF rules, causing BG requests to be blocked and resulting in frontend errors in Yves for affected users.

Solution:

– Exclude excessive data from the CustomerTransfer object.

Overloaded QuoteTransfer object

Excessive enrichment of the QuoteTransfer object with unnecessary data — for example, 20MB of delivery-related information — can significantly degrade checkout performance. This is a common architectural issue observed in custom checkout implementations.

Results:

– Severe slowdown of the checkout process due to large payloads being passed between Yves and Zed.

– Increased memory consumption and latency during quote recalculations.

– Higher risk of timeouts and degraded user experience during order placement.

Solution:

– Exclude non-essential data from the QuoteTransfer.

– Refactor business logic in Zed to calculate delivery details on demand (during the checkout process), rather than embedding them in the QuoteTransfer.

Slow Rendering of Catalog Pages Due to Filter Block Complexity

The use of excessively nested components (molecules, organisms) of the filter block significantly increased Twig rendering overhead. Additionally, some category filters are rendered in full even when hidden via CSS and not displayed to the user, leading to wasted processing. Due to this, catalog pages experienced significant server-side rendering delays under certain conditions.

Results:

– High resource consumption during Twig rendering on catalog pages.

– Inconsistent but recurring slow response times, especially for large filter trees.

– Poor maintainability and diagnosability due to a deeply composed page structure.

Solution:

– Implement caching for the filter block to avoid repeated rendering.

– Remove unused and hidden filters.

– Shift filter block rendering to occur after the product list is displayed, reducing perceived load time and preventing blank screen delay for users.

– Flatten overly nested Atomic Frontend structures in complex filter cases.

Availability Notification Email Overload

The default implementation of back-in-stock notification emails is not prepared to handle high volumes of subscriptions generated by bots. Email sending tasks are extracted from the queue by a single-threaded worker. Due to the consumption of all available resources, the task frequently failed with an exception and was requeued. This resulted in an infinite retry loop, where emails were never actually delivered. Additionally, queue exceptions were silently caught and not reported to New Relic or other monitoring systems.

Results:

– Infinite retry loops in the message queue, causing resource waste and performance issues.

– Critical errors in asynchronous flows are unnoticed due to missing observability.

Solution:

– Implement asynchronous, multi-threaded, or batched email sending to improve throughput and resilience.

Multiple Checkout Form Submissions

Some frontend errors can cause JavaScript, responsible for preventing multiple form submissions, to fail. Users could repeatedly click the submit button on checkout steps, triggering multiple identical requests to the server, which led to duplicated processing on the backend and significantly slowed down the checkout flow, as each request had to complete before the next step could be loaded.

Results:

– Multiple identical requests per checkout step due to repeated form submissions.

– Increased load on backend systems and degraded performance of the checkout process.

– Poor user experience caused by delays and confusion at critical checkout steps.

Solution:

– Disable submit buttons immediately at the beginning of the event handler, before any asynchronous logic or validations.

– Improve frontend resilience by isolating critical scripts from non-critical UI logic to prevent global JS failure from affecting form submission logic.

– Add backend-side safeguards to detect and reject duplicate submissions.

Synchronous Reservation in Initial OMS State

Placing the «reserved» flag on the initial OMS state (typically «new») causes the reservation process to be executed synchronously during order creation. Despite being documented and covered in training, this pattern is still encountered in projects. It leads to performance degradation, as the customer has to wait for the inventory reservation to complete before the next checkout step loads.

Results:

– Inventory reservation runs as a blocking operation during order placement.

– Significant delay in the customer-facing flow after submitting an order.

Solution:

– Remove the reserved flag from the initial OMS state and introduce a dedicated intermediate state with the «reserved» flag, which offloads reservation to an asynchronous process.

Excessive S3 Calls per Order Item Preview

Excessive and repeated interactions with Amazon S3 from PHP when rendering order history pages can significantly slow down page load times. A typical case involves loading document previews (e.g., images, model renders, invoices) for each order item. For security reasons, a private S3 bucket is often used along with presigned URLs. As a result, each preview requires two S3 interactions: one to generate a signed URL, and one to fetch the file. This leads to high latency and potential preview loading failures due to PHP session locks, which prevent the concurrent generation of multiple previews. The issue is difficult to detect, but it often affects high-value customers with large, complex orders.

Results:

– Slow rendering of order history pages due to multiple S3 round-trips.

– High resource usage and latency from synchronous presigned URL generation.

– Preview load failures in the browser due to locked sessions during concurrent S3 access.

– Negative impact on key customer segments with the most expensive and data-heavy orders.

Solution:

– Move static preview of non-vital assets to a public S3 bucket, store file paths in the database, and let the browser load previews directly from S3.

– Load vital previews from private S3 only after the page is rendered using asynchronous requests.

Spryker Modules and Updates

Missing Null Checks in Certain Versions of Spryker Modules

Some versions of Spryker core and shop modules contain unsafe data type assumptions in their code, such as treating nullable return values as guaranteed objects or strings. While some of these issues may have been fixed in later releases, they are still frequently encountered in real-world projects that have not yet upgraded to the latest versions.
A common case is found in spryker-shop/customer-page, where the controller retrieves the current customer and checks for the presence of an ID to decide on logout. However, the method used to fetch the customer may return null, which leads to unhandled errors and application crashes.
A similar case is found in spryker-shop/agent-page (one of the commits of 1.16.0 release), where the findAgentUsername function may return null, while the findAgentUserByUsername function expects a string, which results in a crash during project updates.
We maintain a checklist of known unsafe data type assumptions for various versions of Spryker modules.

Results:

– Runtime errors due to null dereferencing.

Solution:

– Check which module versions are used in the project and verify known null-handling and other unsafe data type assumptions issues in those versions.

– Refactor affected modules.

– When upgrading core or shop modules, proactively review release changes and new code for similar unsafe data type assumptions.

– Enforce stricter type checks using PHPStan to detect nullable return values and enforce explicit validation.

Breaking Changes in the CompanyBusinessUnit Module Affecting Address Updates

A regression introduced in spryker/company-business-unit version 2.16.0 caused a critical issue with customer address persistence. The update replaced usage of the create() method, which previously handled both creation and update, with a new update() method. However, the initial implementation of the update() contained a logic error that resulted in the loss of all address associations for a customer.
As a result, when a user updated their profile, all linked addresses were removed from the /customer/address page and disappeared from the address selection step during checkout.
Although this issue was later resolved in version 2.17.2 (released on Sep 9), it remained unpatched for nearly six months and several intermediate versions, and may still be present in many live projects that are running on versions prior to 2.17.2, where upgrades are delayed or selectively applied.

Results:

– Updating the customer profile triggers the removal of all associated addresses.

– The address step in checkout becomes empty, preventing order placement.

Solution:

– Check the version of spryker/company-business-unit used in the project and verify if it is affected (between 2.16.0 and before 2.17.2).

– If affected, either upgrade to 2.17.2+ or manually apply the patch for address updates.

Twig Version Regression Causing Severe Rendering Slowdowns in Certain Spryker Setups

A performance regression introduced in Twig version 3.9.0 (released on April 16, 2024) caused a significant slowdown in template rendering. Although the issue originates from Twig itself and not from Spryker, it critically affects Spryker-based projects due to the complexity and depth of Twig usage in the Atomic Frontend structure.
The slowdown may go unnoticed at first and is difficult to trace, especially in projects relying heavily on component-driven rendering in Yves. Spryker resolved compatibility with the affected Twig release in spryker/twig module version 3.23.0 (released July 10, 2024), which prevents the installation of the problematic Twig version. However, any project using older versions of spryker/twig (prior to 3.23.0) remains vulnerable if Twig was upgraded independently.

Results:

– Severe degradation of page rendering performance (especially on complex pages like category or checkout).

– Difficult troubleshooting due to the external origin of the issue and unclear correlation with frontend slowness.

Solution:

– Check the project’s installed version of Twig and ensure it is not 3.9.0.

– If Twig 3.9.0 is in use, downgrade to the previous stable version (3.8.x).

– Confirm that spryker/twig is updated to at least version 3.23.0, which blocks incompatible Twig versions via dependency constraints.

– If the project is running an older version of spryker/twig, carefully review Composer dependency updates to avoid introducing the problematic Twig version.

Implementation and Integrations

Undetectable Runtime Errors in Pyz Code Due to Invalid Factory and Dependency Configuration

In Spryker-based projects, the majority of application-specific logic and customizations reside in the Pyz namespace, including custom Factory and DependencyProvider configurations. These areas are flexible by design, but that flexibility introduces a risk: it is possible to write code that passes all static analysis tools (PHPStan, etc.) yet fails at runtime due to incorrect service wiring or broken dependencies.
This is a widespread issue because projects often contain legacy, unreachable, or untested code paths left behind after upgrades or refactoring. Such errors typically remain unnoticed in staging and production environments unless the affected code is triggered by a revert, a deep dependency call, or user action. The problem is not limited to new development: even reverted code or accidental overrides can silently introduce runtime failures.

Results:

– Runtime errors in service resolution via factories or dependency providers that go undetected during CI.

– Reverted or outdated classes reintroduce previously fixed issues.

– Broken but unused code exists in the application, lowering maintainability and increasing technical debt.

Solution:

– Implement project-specific validation tests that systematically instantiate all custom Factory classes under Pyz\* and call their create*() and get*() methods using reflection or factory resolver.

– Ensure overridden Spryker methods in Pyz are also tested, as Pyz may break their dependencies even if the original Spryker implementation is unchanged.

– Treat custom DI wiring in DependencyProvider and Factory as critical code requiring coverage above the typical static level.

– Include these factory-level runtime checks in CI as part of pre-deployment safety validation.

Incomplete PSP API Integration Leading to Payments Without Orders

Partial or incorrect implementation of the Payment Service Provider API integration can result in cases where payments are authorized or even captured, but no corresponding order is created in the system. This typically happens due to a lack of transactional consistency between payment authorization and order placement logic.

Results:

– Payments processed without matching orders in the database.

– Financial discrepancies and manual reconciliation are required.

Solution:

– Prevent CapturePayment from being triggered without an explicit request from the shop backend.

– For payment methods that do not support authorization/capture separation (instant capture), trigger the payment only after the order has been successfully created to prevent irreversible transactions without associated orders.

– Log the entire communication flow with the PSP, including successful and failed steps. Ensure that, in case of order placement failure, payment attempt metadata and status are persistently stored in the database.

Incomplete PSP Integration with Inconsistent OMS and PSP State Machines

Misalignment between the Spryker OMS state machine and the state logic expected by the Payment Service Provider can lead to broken payment flows. In particular, simultaneous use of auto-capture settings and a separate CapturePayment OMS state can result in orders being blocked in an undefined or incomplete status — authorized but not captured, due to missing or misfired transitions.

Results:

– Orders are stuck in intermediate states without successful payment capture.

– Operational overhead from manually recovering blocked transactions.

Solution:

– Align OMS logic with the behavior and configuration of the integrated PSP.

Infrastructure

WAF Blocking Internal Traffic Between Yves and Zed

In certain environments, WAF policies are applied not only to external client traffic (browser→Yves), but also to internal requests between Yves and Zed. This creates false positives and leads to request failures when enriched or user-originated data, already sanitized upstream, is blocked a second time.
A typical example is the transfer of CustomerTransfer objects from Yves to Zed. For example, a password hash in the payload may contain a “../” substring, triggering a WAF rule intended to block path traversal attempts.
Another case occurs in Agent Mode, where XSS protection rules blocked selection of a customer with a company name that included special characters, for example, “CAF’&THÉ”. This again caused a WAF false positive on Yves→Zed request, and the blocking occurred silently — no error is logged in New Relic or backend logs, complicating detection and diagnosis.
In both cases, the problem was not with malicious input, but with legitimate data originating from the database or user input that had already passed WAF filtering at the browser→Yves boundary. Applying WAF rules again on Yves→Zed traffic duplicates filtering effort and introduces fragility without meaningful security gain.

Results:

– Internal requests blocked due to WAF false positives on harmless substrings (e.g., “../”, “&”, “””).

– Legitimate user flows break (e.g., agent cannot select a customer, checkout fails silently).

– Lack of observability: blocked requests never reach the backend and are not reported to monitoring tools.

Solution:

– Review infrastructure to determine whether WAF filtering on internal Yves→Zed traffic is necessary.

– If WAF filtering is required, apply a separate and more relaxed WAF rule set to internal traffic.

– If the options above are blocked, implement validation of data before passing transfer objects between Yves and Zed.

– Ensure observability for blocked requests. WAF responses on internal routes should be logged and monitored.

Autoscaling Group Misconfiguration Causing RabbitMQ Instability

Misconfigurations in autoscaling groups can lead to unintended termination and restarts of RabbitMQ instances in the managed Spryker infrastructure. While AWS may automatically relaunch the instance, systems that rely on persistent connections, such as Jenkins, can hang when RabbitMQ becomes temporarily unavailable.
This issue often goes unnoticed initially. The storefront continues to operate normally, but background processes silently stall: order statuses no longer change, queues are not processed, and asynchronous workflows stop functioning. Jenkins tasks that depend on RabbitMQ may hang indefinitely, effectively blocking the pipeline.

Results:

– RabbitMQ becomes intermittently unavailable due to autoscaling groups misconfigurations.

– Jenkins jobs hang, blocking background processing.

– OMS and event-driven features silently fail, leading to stuck orders and missing transitions.

– No proactive alerts or monitoring signals indicate the failure, delaying detection.

Solution:

– Resolution requires extended investigation and pointing Spryker support to specific symptoms indicating an infrastructure-level configuration error. Spryker support may initially report that the system is functioning as intended, which can significantly delay root cause identification and resolution.