Common Spryker Projects Issues Overview
10.2024
10.2024
Below is a list of common Spryker project issues frequently encountered by AXON21’s team across projects.
Some of these issues may have been addressed in recent platform updates; however, since upgrades are often postponed or difficult to apply in many projects, these issues continue to surface and are effectively resolved only through a structured monitoring and maintenance process.
Wider elimination of these issues across the Spryker stack can be achieved by reducing the project’s technical debt that prevents them from updating to the latest platform versions.
The list is divided into four logical groups based on how these issues impact different aspects of project maintenance.
Excessive Data in CustomerTransfer object
By default, the CustomerTransfer object is filled with the entire customer database record, including sensitive fields such as password hashes. This full object is then stored in the user session.
Results:
– Excessive payload between Yves and Zed leads to slower communication and increased network overhead.
– Unnecessary and unsafe exposure of authentication-related data — password hashes and other internal fields are transmitted to the frontend, creating security risks.
– XSS-related WAF blocking — some password hashes may contain character sequences that trigger WAF rules, causing BG requests to be blocked and resulting in frontend errors in Yves for affected users.
Solution:
– Exclude excessive data from the CustomerTransfer object.
Overloaded QuoteTransfer object
Excessive enrichment of the QuoteTransfer object with unnecessary data — for example, 20MB of delivery-related information — can significantly degrade checkout performance. This is a common architectural issue observed in custom checkout implementations.
Results:
– Severe slowdown of the checkout process due to large payloads being passed between Yves and Zed.
– Increased memory consumption and latency during quote recalculations.
– Higher risk of timeouts and degraded user experience during order placement.
Solution:
– Exclude non-essential data from the QuoteTransfer.
– Refactor business logic in Zed to calculate delivery details on demand (during the checkout process), rather than embedding them in the QuoteTransfer.
Slow Rendering of Catalog Pages Due to Filter Block Complexity
The use of excessively nested components (molecules, organisms) of the filter block significantly increased Twig rendering overhead. Additionally, some category filters are rendered in full even when hidden via CSS and not displayed to the user, leading to wasted processing. Due to this, catalog pages experienced significant server-side rendering delays under certain conditions.
Results:
– High resource consumption during Twig rendering on catalog pages.
– Inconsistent but recurring slow response times, especially for large filter trees.
– Poor maintainability and diagnosability due to a deeply composed page structure.
Solution:
– Implement caching for the filter block to avoid repeated rendering.
– Remove unused and hidden filters.
– Shift filter block rendering to occur after the product list is displayed, reducing perceived load time and preventing blank screen delay for users.
– Flatten overly nested Atomic Frontend structures in complex filter cases.
Availability Notification Email Overload
The default implementation of back-in-stock notification emails is not prepared to handle high volumes of subscriptions generated by bots. Email sending tasks are extracted from the queue by a single-threaded worker. Due to the consumption of all available resources, the task frequently failed with an exception and was requeued. This resulted in an infinite retry loop, where emails were never actually delivered. Additionally, queue exceptions were silently caught and not reported to New Relic or other monitoring systems.
Results:
– Infinite retry loops in the message queue, causing resource waste and performance issues.
– Critical errors in asynchronous flows are unnoticed due to missing observability.
Solution:
– Implement asynchronous, multi-threaded, or batched email sending to improve throughput and resilience.
Multiple Checkout Form Submissions
Some frontend errors can cause JavaScript, responsible for preventing multiple form submissions, to fail. Users could repeatedly click the submit button on checkout steps, triggering multiple identical requests to the server, which led to duplicated processing on the backend and significantly slowed down the checkout flow, as each request had to complete before the next step could be loaded.
Results:
– Multiple identical requests per checkout step due to repeated form submissions.
– Increased load on backend systems and degraded performance of the checkout process.
– Poor user experience caused by delays and confusion at critical checkout steps.
Solution:
– Disable submit buttons immediately at the beginning of the event handler, before any asynchronous logic or validations.
– Improve frontend resilience by isolating critical scripts from non-critical UI logic to prevent global JS failure from affecting form submission logic.
– Add backend-side safeguards to detect and reject duplicate submissions.
Synchronous Reservation in Initial OMS State
Placing the «reserved» flag on the initial OMS state (typically «new») causes the reservation process to be executed synchronously during order creation. Despite being documented and covered in training, this pattern is still encountered in projects. It leads to performance degradation, as the customer has to wait for the inventory reservation to complete before the next checkout step loads.
Results:
– Inventory reservation runs as a blocking operation during order placement.
– Significant delay in the customer-facing flow after submitting an order.
Solution:
– Remove the reserved flag from the initial OMS state and introduce a dedicated intermediate state with the «reserved» flag, which offloads reservation to an asynchronous process.
Excessive S3 Calls per Order Item Preview
Excessive and repeated interactions with Amazon S3 from PHP when rendering order history pages can significantly slow down page load times. A typical case involves loading document previews (e.g., images, model renders, invoices) for each order item. For security reasons, a private S3 bucket is often used along with presigned URLs. As a result, each preview requires two S3 interactions: one to generate a signed URL, and one to fetch the file. This leads to high latency and potential preview loading failures due to PHP session locks, which prevent the concurrent generation of multiple previews. The issue is difficult to detect, but it often affects high-value customers with large, complex orders.
Results:
– Slow rendering of order history pages due to multiple S3 round-trips.
– High resource usage and latency from synchronous presigned URL generation.
– Preview load failures in the browser due to locked sessions during concurrent S3 access.
– Negative impact on key customer segments with the most expensive and data-heavy orders.
Solution:
– Move static preview of non-vital assets to a public S3 bucket, store file paths in the database, and let the browser load previews directly from S3.
– Load vital previews from private S3 only after the page is rendered using asynchronous requests.
Missing Null Checks in Certain Versions of Spryker Modules
Some versions of Spryker core and shop modules contain
unsafe data type assumptions in their code, such
as treating nullable return values as guaranteed
objects or strings. While some of these issues may
have been fixed in later releases, they are still
frequently encountered in real-world projects that have
not yet upgraded to the latest versions.
A common case is found
in spryker-shop/customer-page, where the
controller retrieves the current customer
and checks for the presence of an ID to decide
on logout. However, the
method used to fetch the customer
may return null, which leads to unhandled errors and
application crashes.
A similar case is found
in spryker-shop/agent-page (one of the commits
of 1.16.0 release),
where the findAgentUsername function may return null, while
the findAgentUserByUsername function expects a string,
which results in a crash during project updates.
We maintain a checklist of known unsafe data
type assumptions for various versions of Spryker modules.
Results:
– Runtime errors due to null dereferencing.
Solution:
– Check which module versions are used in the project and verify known null-handling and other unsafe data type assumptions issues in those versions.
– Refactor affected modules.
– When upgrading core or shop modules, proactively review release changes and new code for similar unsafe data type assumptions.
– Enforce stricter type checks using PHPStan to detect nullable return values and enforce explicit validation.
Breaking Changes in the CompanyBusinessUnit Module Affecting Address Updates
A regression introduced
in spryker/company-business-unit version 2.16.0 caused
a critical issue with customer address persistence. The
update replaced usage of the create() method, which
previously handled both creation and update, with a new
update() method. However, the initial implementation
of the update() contained a logic error that
resulted in the loss of all address associations for
a customer.
As a result, when a user updated their profile,
all linked addresses were removed from the /customer/address
page and disappeared from the address selection step during
checkout.
Although this issue was later resolved in version 2.17.2
(released on Sep 9), it remained unpatched for
nearly six months and several intermediate versions, and may
still be present in many live projects that are
running on versions prior to 2.17.2, where upgrades
are delayed or selectively applied.
Results:
– Updating the customer profile triggers the removal of all associated addresses.
– The address step in checkout becomes empty, preventing order placement.
Solution:
– Check the version of spryker/company-business-unit used in the project and verify if it is affected (between 2.16.0 and before 2.17.2).
– If affected, either upgrade to 2.17.2+ or manually apply the patch for address updates.
Twig Version Regression Causing Severe Rendering Slowdowns in Certain Spryker Setups
A performance regression introduced in Twig version
3.9.0 (released on April 16, 2024) caused
a significant slowdown in template rendering.
Although the issue originates from Twig itself and not from
Spryker, it critically affects Spryker-based projects due
to the complexity and depth of Twig usage
in the Atomic Frontend structure.
The slowdown may go unnoticed at first and
is difficult to trace, especially in projects
relying heavily on component-driven rendering
in Yves. Spryker resolved compatibility with the affected
Twig release in spryker/twig module version 3.23.0
(released July 10, 2024), which prevents the installation
of the problematic Twig version. However, any project
using older versions of spryker/twig (prior
to 3.23.0) remains vulnerable if Twig was upgraded
independently.
Results:
– Severe degradation of page rendering performance (especially on complex pages like category or checkout).
– Difficult troubleshooting due to the external origin of the issue and unclear correlation with frontend slowness.
Solution:
– Check the project’s installed version of Twig and ensure it is not 3.9.0.
– If Twig 3.9.0 is in use, downgrade to the previous stable version (3.8.x).
– Confirm that spryker/twig is updated to at least version 3.23.0, which blocks incompatible Twig versions via dependency constraints.
– If the project is running an older version of spryker/twig, carefully review Composer dependency updates to avoid introducing the problematic Twig version.
Undetectable Runtime Errors in Pyz Code Due to Invalid Factory and Dependency Configuration
In Spryker-based projects, the majority
of application-specific logic and customizations reside
in the Pyz namespace, including custom Factory and
DependencyProvider configurations. These areas are flexible
by design, but that flexibility introduces a risk:
it is possible to write code that passes all
static analysis tools (PHPStan, etc.) yet fails
at runtime due to incorrect service wiring
or broken dependencies.
This is a widespread issue because projects often
contain legacy, unreachable, or untested code paths left
behind after upgrades or refactoring. Such errors
typically remain unnoticed in staging and production
environments unless the affected code is triggered
by a revert, a deep dependency call,
or user action. The problem is not limited
to new development: even reverted code or accidental
overrides can silently introduce runtime failures.
Results:
– Runtime errors in service resolution via factories or dependency providers that go undetected during CI.
– Reverted or outdated classes reintroduce previously fixed issues.
– Broken but unused code exists in the application, lowering maintainability and increasing technical debt.
Solution:
– Implement project-specific validation tests that systematically instantiate all custom Factory classes under Pyz\* and call their create*() and get*() methods using reflection or factory resolver.
– Ensure overridden Spryker methods in Pyz are also tested, as Pyz may break their dependencies even if the original Spryker implementation is unchanged.
– Treat custom DI wiring in DependencyProvider and Factory as critical code requiring coverage above the typical static level.
– Include these factory-level runtime checks in CI as part of pre-deployment safety validation.
Incomplete PSP API Integration Leading to Payments Without Orders
Partial or incorrect implementation of the Payment Service Provider API integration can result in cases where payments are authorized or even captured, but no corresponding order is created in the system. This typically happens due to a lack of transactional consistency between payment authorization and order placement logic.
Results:
– Payments processed without matching orders in the database.
– Financial discrepancies and manual reconciliation are required.
Solution:
– Prevent CapturePayment from being triggered without an explicit request from the shop backend.
– For payment methods that do not support authorization/capture separation (instant capture), trigger the payment only after the order has been successfully created to prevent irreversible transactions without associated orders.
– Log the entire communication flow with the PSP, including successful and failed steps. Ensure that, in case of order placement failure, payment attempt metadata and status are persistently stored in the database.
Incomplete PSP Integration with Inconsistent OMS and PSP State Machines
Misalignment between the Spryker OMS state machine and the state logic expected by the Payment Service Provider can lead to broken payment flows. In particular, simultaneous use of auto-capture settings and a separate CapturePayment OMS state can result in orders being blocked in an undefined or incomplete status — authorized but not captured, due to missing or misfired transitions.
Results:
– Orders are stuck in intermediate states without successful payment capture.
– Operational overhead from manually recovering blocked transactions.
Solution:
– Align OMS logic with the behavior and configuration of the integrated PSP.
WAF Blocking Internal Traffic Between Yves and Zed
In certain environments, WAF policies are applied not
only to external client traffic (browser→Yves),
but also to internal requests between Yves and Zed. This
creates false positives and leads to request failures when
enriched or user-originated data, already sanitized
upstream, is blocked a second time.
A typical example is the transfer
of CustomerTransfer objects from Yves to Zed.
For example, a password hash in the payload may
contain a “../” substring, triggering
a WAF rule intended to block path traversal attempts.
Another case occurs in Agent Mode, where XSS protection
rules blocked selection of a customer with
a company name that included special characters, for
example, “CAF’&THÉ”. This again
caused a WAF false positive on Yves→Zed
request, and the blocking occurred silently —
no error is logged in New Relic or backend
logs, complicating detection and diagnosis.
In both cases, the problem was not with malicious input,
but with legitimate data originating from the database or user
input that had already passed WAF filtering at the
browser→Yves boundary. Applying WAF rules again
on Yves→Zed traffic duplicates filtering effort and
introduces fragility without meaningful security gain.
Results:
– Internal requests blocked due to WAF false positives on harmless substrings (e.g., “../”, “&”, “””).
– Legitimate user flows break (e.g., agent cannot select a customer, checkout fails silently).
– Lack of observability: blocked requests never reach the backend and are not reported to monitoring tools.
Solution:
– Review infrastructure to determine whether WAF filtering on internal Yves→Zed traffic is necessary.
– If WAF filtering is required, apply a separate and more relaxed WAF rule set to internal traffic.
– If the options above are blocked, implement validation of data before passing transfer objects between Yves and Zed.
– Ensure observability for blocked requests. WAF responses on internal routes should be logged and monitored.
Autoscaling Group Misconfiguration Causing RabbitMQ Instability
Misconfigurations in autoscaling groups can lead
to unintended termination and restarts of RabbitMQ
instances in the managed Spryker infrastructure. While AWS
may automatically relaunch the instance, systems that rely
on persistent connections, such as Jenkins, can hang
when RabbitMQ becomes temporarily unavailable.
This issue often goes unnoticed initially. The storefront
continues to operate normally, but background processes
silently stall: order statuses no longer change, queues
are not processed, and asynchronous workflows stop functioning.
Jenkins tasks that depend on RabbitMQ may hang
indefinitely, effectively blocking the pipeline.
Results:
– RabbitMQ becomes intermittently unavailable due to autoscaling groups misconfigurations.
– Jenkins jobs hang, blocking background processing.
– OMS and event-driven features silently fail, leading to stuck orders and missing transitions.
– No proactive alerts or monitoring signals indicate the failure, delaying detection.
Solution:
– Resolution requires extended investigation and pointing Spryker support to specific symptoms indicating an infrastructure-level configuration error. Spryker support may initially report that the system is functioning as intended, which can significantly delay root cause identification and resolution.