Table of Contents
- Executive summary
- Context and business tasks in Russia
- Architectures, data, and algorithms
- Implementation case studies
- Tools and integrations
- Roadmap, budgets, and KPI
- Risks, law, and ethics
Executive summary
AI in web analytics today is not about “installing a neural network and it automatically found sales growth,” but about the discipline of building behavioral data (clickstream / event-level data), their quality, controllability, and the subsequent “overlay” of models: predictive (purchase/churn probability, LTV), diagnostic (reasons for metric decline), causal (uplift/incrementality), and generative (auto-summaries, explanations, chat analyst). In practice, mature companies first build the “data foundation” and only then receive ROI from the AI layer — otherwise the models are trained on noise and produce beautiful but incorrect “analytics.” This is clearly seen even in the cases of large players, where the key value is the speed, quality, and consistency of data rather than the “magic of the algorithm.”
For entrepreneurs in Russia, three features matter:
- The legal framework for personal data. Under Federal Law No. 152-FZ, “personal data” is any information relating to an identified or identifiable natural person, directly or indirectly. This is a broad definition, so many identifiers (cookie/device ID/online identifiers that can be linked to a person) should, in practice, be designed as potentially personal data.
- Localization when collected via the Internet. Part 5 of Article 18 of 152-FZ (in the version effective from 2025) establishes that when collecting personal data (including via the Internet), the recording/systematization/accumulation/storage/clarification/retrieval of personal data of Russian citizens using databases outside the Russian Federation are not permitted, except for certain exceptions under Article 6. This directly affects the architecture of web analytics and the choice of SaaS tools.
- Growing liability and “proceduralization”. Since May 30, 2025, substantially stricter offenses and fines for violations in the area of personal data have applied (including for failure to submit notifications and for data leaks) — the changes were introduced by Federal Law No. 420-FZ into the Code of Administrative Offenses of the Russian Federation; the published text is available on the official legal information portal.
What is not specified by you, but is critical for accurate design and budgeting: industry (e-commerce/services/SaaS/content), current stack, approximate number of events/month (or traffic), presence of a mobile app, share of advertising (performance), need for real-time, required depth of integration with CRM/offline sales, preferred hosting environment (on-prem/Russian cloud), acceptable payback horizon. Below I provide solution options for typical business profiles and indicate where clarification is needed.
Context and business tasks in Russia
What exactly “AI for web analytics” means
To make the article useful for an entrepreneur (and not only for a data engineer), let’s divide AI web analytics into three layers:
Layer A: Data collection and quality (without it, AI does not work).
This includes event tracking, identification, consistent metric definitions, schema control (data contracts), anti-bot/anti-fraud, enrichment (UTM, sources, reference data), and delivering data into storage/marts. It is this layer in large companies that scales to tens of millions of events per minute and becomes a standalone platform. In Avito, Clickstream is described as a system for collecting and processing analytics events with reliable delivery of “20 million cs events per minute,” a unified event format, and client-side data validation — an example of how “analytics” starts with data engineering.
Layer B: ML models (predictive/diagnostic/causal analytics).
Examples: purchase and churn probability, predicted revenue, segmentation by action probability, LTV, propensity scoring, attribution, anomaly detection, and root-cause analysis. For example, the Google Analytics 4 documentation describes predictive metrics (purchase/churn probability/predicted revenue), and predictions are refreshed regularly and can be disabled if model quality deteriorates.
In Adobe Analytics, anomaly detection is described as a statistical method for detecting metric changes relative to historical data, separating “signal from noise,” and supporting KPI forecasts — a typical “ML layer” within an analytics platform.
Layer C: GenAI/LLM on top of analytics (interface and decision acceleration).
Typical functions: generating explanations of “why conversion dropped,” auto-summaries, segment suggestions, a “chat analyst” (text-to-SQL / text-to-metric), and investigation automation. In some product analytics platforms, this is highlighted as an “AI assistant” and separate modes of behavior analysis.
Which business tasks most often deliver ROI
For entrepreneurs, the key criterion is not the “coolness of the model,” but incremental profit/savings (what changed thanks to the system). In practice, 6 areas most often pay off:
- Faster detection of problems and opportunities (anomaly analysis + diagnostics).
- When the business learns about a drop in revenue/conversion not “after a week” but “after 30 minutes,” and immediately sees the likely causes (channel/page/segment/region/version). In industrial tools, this is done with anomaly detection and causal analysis.
- Predictive segments for marketing and CRM (LTV/propensity/churn).
- For example, “purchase probability in the next 7 days” or “churn probability” can be used to build audiences and communications.
- In Russian practice, Mindbox, in a piece about LTV forecasting, cites the Mario Berlucci case: with ~200,000 website visitors per month, a 5-person data science team implemented a user-action prediction mechanism in six months that brings the company more than 30% of revenue (as stated in the source).
- Personalization of the experience on the website/app (recommendations + real-time triggers).
- This is not only “recommendation blocks,” but also personalized discounts, selections, and onboarding scenarios. In international cases, personalization almost always relies on low-latency clickstream data and unified customer profiles. For example, Burberry describes a Snowplow + Databricks setup for an “AI-Ready Customer 360,” more than 40 personalized models (recommendations, propensity, LTV), and a sharp reduction in clickstream data latency.
- End-to-end/marketing analytics and attribution (including data-driven).
- The classic problem: “last click” distorts the picture, and money is redistributed incorrectly. To move toward a fairer assessment of channels, Markov, Shapley, and other data-driven methods described in the scientific literature on multi-channel attribution are used.
- Savings on manual analytics (self-service + LLM assistant).
- In mature organizations, analytics becomes a “mass competency”: product, marketing, and sales answer questions themselves without waiting in a queue for analysts. This is visible, for example, in product analytics cases where the focus is on self-service.
- Resilience to sanctions/vendor risks (data control).
- Russian companies explicitly describe the motivation to “not depend on third-party vendors” and to have control over data, including the risk of blocks/leaks. In the T-Bank example, it is shown that even internal data platforms require mature backup and Governance processes: an error that deleted clickstream data became a separate incident and a lesson in reliability.
Architectures, data, and algorithms
Reference architecture for “AI web analytics” for Russian business
Below is a universal scheme that works for both e-commerce and services/subscriptions. In Russian conditions, it is critical to provide for a “PDn perimeter” (localization, access, logging) and the ability to collect data server-side.
Web / App / Server Web SDK Collector Mobile SDK Server-side events Consent/Cookie banner Kafka / Queue Flink/Spark Streaming: enrichment + validation ClickHouse / DWH Object Storage / Data Lake dbt/ETL: marts + metrics BI/dashboards Feature Store Training Pipeline Model Serving Activation: CDP/CRM/Ads via Reverse ETL LLM assistant: explanations/summaries Data+Model Monitoring Show code
Why mature systems are structured exactly this way.
Avito describes an industrial clickstream event-driven architecture with a unified format and at-least-once delivery guarantees, as well as streaming enrichment and anti-bot cleanup jobs on Flink and retention in Kafka as a mechanism to “survive an outage without losses” within the window.
At Magnit (together with Manzana Group), the use of managed ClickHouse for loyalty program analytics and unloading to object storage is described, which is typical for building “marts + history.”
Two practical architecture options for getting started in Russia
Option for small/medium businesses: “cloud DWH + minimal engineering.”
The basis: tracker → export of raw logs → ClickHouse → BI → simple ML tasks (funnel analysis, cohorts, anomaly detection) in notebooks/jobs. Practical steps in Yandex Cloud documentation: collecting data from Yandex Metrica via Logs API, loading it into ClickHouse, and calculating funnels/cohorts/retention in DataSphere with subsequent visualization in DataLens.
Option for high-load businesses: “in-house event platform + composable CDP.”
Suitable if you have many sources (web+app+offline), need real-time, and have strong requirements for data control and flexibility. An international example is the Transavia case, which describes a transition from a “fragmented DIY pipeline” to a composable CDP (Snowplow + Databricks + reverse ETL). The value is unified collection across web+app, centralized business logic, faster delivery of use cases, and measurable incremental impact.
Data templates for AI web analytics
If you are building a system “for AI,” you need not only “page_view,” but also a proper event taxonomy and a data contract. Below is a template that can be applied in almost any industry.
Event template (event fact) — recommended required fields
event_time(UTC)event_name(for example,page_view,product_view,add_to_cart,purchase)event_id(UUID)session_idanonymous_id(cookie/device ID; preferably your own, server-side)user_id(if authenticated; better stored as a hash/internal surrogate key)page_url,referrer,utm_*,traffic_sourcedevice,os,browser,app_versiongeo(country/region/city — according to the minimization policy)properties(JSON/Map for extensions)
Example of a (simplified) event in JSON
json
Copy
{
"event_id": "b3e1d9b2-23a6-4b2a-8a8c-2dfe1c6b2a01",
"event_time": "2026-03-03T10:15:22Z",
"event_name": "add_to_cart",
"session_id": "s-8c1b8f",
"anonymous_id": "a-9f12c3",
"user_id": "u-4a71f2",
"page_url": "https://example.ru/product/123",
"referrer": "https://example.ru/search?q=...",
"utm_source": "yandex",
"utm_medium": "cpc",
"properties": {
"product_id": "123",
"price": 1990,
"currency": "RUB",
"quantity": 1
}
}
Why contracts and a shared semantic model are more important than “a hundred events.”
In the Avito example, emphasis is placed on a “event-driven architecture with unified field semantics” and “centralized management of event definitions,” as well as client-side validation. This is exactly what prevents an “event zoo” and helps ML avoid training on garbage.
And in the review of T-Bank’s data platform, the Governance layer and “Data Contracts” are separately highlighted as the central point of expectations and accountability for data (structure, SLA, etc.) — this is a best practice for scalable analytics.
Approaches to building an event schema
Approach 1: self-describing / schema registry (strictly for “engineering” analytics and ML).
Snowplow provides a detailed description of the self-describing JSON approach and the need for schemas/registry (Iglu) to validate events and arbitrary fields. This is convenient when you want to guarantee the quality of the input stream and schema evolution.
Approach 2: autocapture + selective manual instrumentation (fast for product).
The PostHog documentation says that by default the platform automatically captures pageviews, clicks, input changes, and form submissions, and this can be filtered/configured. This speeds up the start, but requires discipline: autocapture quickly creates “noisy” events that degrade models and reports if normalization is not done.
Approach 3: exporting raw logs from counters (often a compromise in Russia).
In the Yandex ecosystem, there is Metrica Logs API and “connecting Logs API to ClickHouse” (official documentation), which makes it possible to move from aggregates to raw data and build your own metrics/ML system on top of it.
For large businesses, the product “Metrica Pro” is highlighted, where real-time streaming of unaggregated Metrica data into managed ClickHouse in Yandex Cloud without volume limits is described, as well as extended quotas/access to data without sampling; the source lists the price as “from 300,000 ₽/month,” separate from the cluster infrastructure.
Algorithms that are actually used in AI web analytics
Below is an “algorithm map” tied to practical applicability and metrics.
Anomaly detection (metrics, conversions, revenue, traffic)
- Basic level: statistical corridors, seasonality, control charts.
- Advanced level: Seasonal-Hybrid ESD and its variations, robust approaches for seasonal series (described in research on time-series anomaly detection).
- Product implementation: anomaly detection as a function of the analytics platform (example: Adobe Analytics).
Predictive lifecycle models (propensity/churn/LTV)
- An out-of-the-box example: predictive metrics in GA4 (purchase/churn probability/predicted revenue) and predictive audiences based on them.
- For in-house implementation: gradient boosting/neural networks/survival analysis, but it is critical to define: what counts as “churn” for your business and what observation/prediction windows are. (This part is always “unspecified” until the business model is clarified.)
Causal models and uplift (incrementality of marketing/personalization)
- Uplift modeling is described as a family of ML techniques for estimating the causal effect of a treatment at the individual/segment level and is used for personalization in e-commerce.
- Why this matters: the business does not need the “probability of buying,” but the “probability of buying because of "campaigns/changes." This is the key to reducing "empty discounts" and optimizing the marketing budget.
Data-driven attribution (Markov/Shapley, etc.)
- For multichannel attribution, Markov models are widely discussed; they redistribute credit among channels differently than last-touch, and this is described in works on customer journey analysis.
- Shapley methods for attribution modeling in online advertising are analyzed in a scientific article on arXiv (and offer more efficient computations than "naive" versions).
- A practical recommendation for businesses: start with something "simple but stable" (for example, last-click + rules), and move to Markov/Shapley only when you are confident in the quality of the journeys and identifiers, and have ruled out bots.
LLM assistant in analytics (communication, diagnostics, acceleration)
- In product platforms, approaches such as "automated insights" and AI agents for data analysis and presenting insights to the user are described.
- Technically, this is most often "RAG over metrics + tools for querying the DWH," rather than "a model that knows your data." For Russia, the key question is where the LLM is hosted and what data it can access (see the section on law and ethics).
Implementation cases
Below is a selection of cases (international and Russian) where you can see: architecture, organizational takeaways, and measurable effects.
International cases
Burberry: real-time clickstream + Customer 360 + dozens of personalization models
The Snowplow case describes how Burberry reduced clickstream data latency by 99% and uses an "AI-Ready Customer 360," and also mentions a set of personalized models (recommendations, propensity scoring, lifetime value). It also describes a move to server-side cookies and an increase in "cookie duration" to 12 months (in the source, as a result) and more accurate attribution.
A practical takeaway for entrepreneurs: ROI often comes not from a "complex neural network," but from data arriving on time and becoming usable for activation (sales associates in stores can see behavior "here and now").
Transavia: moving from a fragmented DIY pipeline to a composable CDP and measurable incremental impact
The case describes in detail the problems of a "tool zoo": duplicated logic, poor data quality, lack of monitoring, inability to include a mobile app, rising license costs, and the need for a separate engineering team. It then describes the move to a composable CDP (Snowplow as a unified collection layer, Databricks for processing/storage, reverse ETL for activation). The stated results include €27 million in incremental revenue, a 40% reduction in license costs, faster use-case implementation (in the text — "from 6 months to 1 month" for a specific case), conversion uplift in certain channels, and improved NPS at personalized touchpoints.
Practical takeaway: if you have "many services and many contractors," the "best-of-breed + unified event collection + centralized logic" model is often cheaper and easier to manage than a set of disconnected trackers.
Square: self-service product analytics as a culture
The Amplitude case describes Square as using the platform as a "central source for user insights," and cites a metric of "100+ employees using Amplitude daily," as well as "billions of events per month." This illustrates the value of self-service: not "waiting for analysts," but making decisions based on event data at the level of product and marketing teams.
Jumbo Interactive: predictive cohorts for user activation
In the Amplitude material on "Predictive Cohorts," it is described that the feature builds cohorts based on future behavior, and the Jumbo Interactive example is given: identifying users likely to activate and sending email offers to nudge them down the right paths.
Rakuten (via the Rakuten Viber product): segmentation and KPI work
The Mixpanel blog gives an example of segmentation analysis at Rakuten Viber to understand the drivers of engagement and retention through segments and KPI.
Takeaway: even without "heavy ML," proper segmentation can be a stepping stone to predictive models.
Russian cases
Avito: its own clickstream platform and server-to-server activation
In one Avito article, it explains why they need a tool that "does not depend on third-party vendors," is easy to integrate, and provides self-service visualization; alternatives (open-source and commercial) and the risk of dependency/leaks/blocking are also mentioned separately. The key point is the description of Clickstream as a platform with 20 million events/minute, centralized event management, client-side validation, and a unified format for end-to-end analytics.
The second article describes the Marketing Manager service: sending targeted events to external ad systems without a "zoo of SDKs" in the frontend, with the ability to enrich data with analytical models, and an example architecture: events → clickstream → Flink enrichment → topics by "accounts" → sending. It also explicitly mentions the risks of sanctions, performance degradation due to 3rd-party SDKs, and the advantages of the S2S approach.
Takeaway for entrepreneurs: server-side collection and activation is not "complicated and expensive," but a way to (a) reduce risks, (b) improve site performance, and (c) ensure data quality and manageability for ML.
Magnit + Manzana Group: loyalty program migration, analytics on ClickHouse, personalized offers
The Yandex Cloud case provides rare "scale numbers": migration of loyalty for 80 million customers, migration of 2–3 million users per week, support for 300 million personalized offers, peak loads of 7.5 thousand transactions/sec, use of 230 virtual machines, storage/processing of 1 PB of data on the cloud side (as context), selection of managed ClickHouse for reporting and analytics, unloading to object storage (50+ TB). The motivation to "ensure compliance with Federal Law 152" is also stated separately.
Takeaway: for large volumes, "end-to-end analytics + personalization" always goes hand in hand with infrastructure and compliance.
Galamart: growing average basket size and retention through a loyalty program, segmentation, and experiments
The Mindbox case presents results: an increase in average basket size among loyalty program participants (+37% versus non-participants), an increase in customer return rate (retention rate +12%), and organizational lessons on offline data collection (cashiers as a "bottleneck"), segmentation, and A/B testing of hypotheses.
Takeaway: even the "non-AI" part (data collection, identification, experiments) is the foundation without which ML predictions cannot be validated.
Mario Berlucci: small traffic by enterprise standards, but a functioning ML loop
In the Mindbox material on LTV prediction, the example given is: ~200 thousand website visitors/month, a data science team of 5 roles (analyst, 2 data science, marketer, developer), 6 months to implement the prediction of user actions, and the claim that the mechanism brings in >30% of revenue.
Takeaway: ML web analytics is possible not only for "hypermarketplaces," but it requires focusing on one or two use cases with clear monetization.
Tools and integrations
Comparison table of tools for AI web analytics
The table below is a reference for solution classes. Important: legal applicability (localization/personal data) depends on whether the databases are physically locatedand who is the controller/processor and how you establish the initial recording of personal data (see the “Law” section).
| ClassExampleStrengths for AILimitations/RisksSource | ||||
| Built-in AI in analytics | GA4 (predictive metrics/audiences) | Fast start for predictive segments and forecasts for marketing | Platform dependency; compliance/localization — “not specified” without analyzing the collection architecture | |
| Enterprise analytics with anomaly detection | Adobe Analytics | Automatic anomaly detection, KPI forecasts, “noise vs signal” | Cost/licensing; implementation and Governance are more complex | |
| Russian counter + DWH add-on | Yandex Metrica + Logs API/ClickHouse | Export of raw data, calculation of proprietary metrics/ML in DWH | You need to build marts/ML yourself; quota limitations in the basic version | |
| Metrica enterprise data package | Metrica Pro | Real-time streaming of unaggregated data into managed ClickHouse, without sampling; price “from 300k RUB/month” (without the cluster) | Not suitable for everyone in terms of volume/price; you still need a DWH/team | |
| Open-source product analytics | PostHog (self-host) | Autocapture of events, product analytics, there is an AI direction in the product | Requires engineering support and event discipline | |
| Open-source web analytics (privacy-oriented) | Matomo (self-host) | Cookie-free/no-personal-data modes are possible (depending on configuration and jurisdiction) | You need to carefully align this with Russian law and your identification model | |
| Behavioral data platform (Composable / enterprise) | Snowplow | Strict event schemas (self-describing), delivery to DWH/stream, “AI-ready” behavioral data | Requires a mature data team and schema governance | |
| Columnar DWH for events | ClickHouse | OLAP database for high event volumes and fast queries | Requires partitioning design, retention, marts |
Typical technology stack for development (open-source + commercial)
Below is a “constructor” made of components. You should choose not “everything,” but the minimum set for your maturity and compliance requirements.
Event collection (web/app/server)
- SDKs/tags for web and mobile; server-side events for orders/payments/statuses (better as the “truth” than a “button click”). Avito shows an S2S approach and the idea of “not multiplying 3rd party SDKs in the front end.”
- Consent/CMP: a technical component for managing consent and data minimization (legal details below).
Transport and stream processing
- Kafka/queue + Flink/Spark Streaming for enrichment, anti-bot, deduplication, sessionization. Avito explicitly describes an event pipeline through clickstream and flink-jobs, at-least-once delivery, and Kafka retention as a resilience mechanism.
Storage and marts
- ClickHouse as OLAP for events, Object Storage for history and low-cost storage. In the “Magnit + Manzana” case, ClickHouse is used for analytics, and daily data is unloaded to object storage; 50+ TB has been accumulated.
- dbt/ETL layer for “single metric definitions” and marts.
ML/MLOps
- Feature Store (Feast or a warehouse-native approach), MLflow/Kubeflow, data quality and drift monitoring.
- For uplift/causality, separate pipelines are needed, since experiments and correct setup are required.
BI and self-service
- BI (DataLens/Superset/Metabase) + data catalog + access policies. In the T-Bank review, their in-house BI based on Superset is described, along with separate metrics for the number of streams/dashboards/quality checks as signs of platform maturity.
CDP, CRM, BI integrations: how to connect them so AI generates revenue
CDP ↔ DWH (composable CDP instead of a monolith)
- The idea: events are collected into a unified schema, processed in the DWH, and the necessary segments/scorings are returned to activation systems (reverse ETL). In the Transavia case, reverse ETL appears directly as part of a composable CDP.
CRM ↔ web analytics (linking online and orders)
- Critical: calculate revenue/margin/repeat purchases not “by clicks,” but by orders from CRM/OMS.
- In Metrica Pro, data about customers and orders from CRM (hash identifiers of orders and users, statuses, order contents) are separately mentioned as part of the extended data for analytics.
BI ↔ LLM (chat analyst)
- The best pattern: the LLM does not receive “raw personal data,” but works through a layer of metrics/marts and query tools, with logging. This reduces leak risk and makes answers reproducible (a requirement for controllability).
Roadmap, budgets, and KPIs
Where businesses in Russia should start
Below is a sequence that minimizes the risk of “invested in ML, but there is no data.” Where information is required but not present in the request, I mark “not specified” and suggest options.
Step zero: define the goal and the first use cases (2–5 weeks)
What should be written on 1 page:
- Which decisions will become better and faster? (for example, “ROMI control,” “personalization,” “retention,” “anti-fraud,” “faster investigations”).
- What will count as “success” in numbers (see KPIs below).
- What data sources and which channels: website/app/CRM/call tracking (not specified).
- Placement requirements: on-prem/Russian cloud (not specified).
- Constraints: legal (152-FZ), sanctions-related, staffing.
Step one: data inventory and legal framework (1–3 weeks)
- Determine which data may be personal data (ID, email/phone, payments, IP — depending on linkability). 152-FZ gives a broad definition of personal data and defines the “controller” and “processing” as a set of actions (collection/recording/storage/transmission, etc.).
- Record where the initial storage of Russian citizens’ data will be located (in Russia) — this affects the choice of SaaS and architecture.
- Check whether notification to Roskomnadzor is required before processing personal data begins (see Article 22 of 152-FZ) and the submission procedure via the portal.
- If there is cross-border transfer, take into account the notification procedure to RKN that has been in effect since March 1, 2023 (official PDN portal).
Step two: event taxonomy and data contracts (2–6 weeks)
- Describe 30–80 “signal” events needed for the first use cases and their properties.
- Establish a rule: each event has an owner, a description of its business meaning, and quality tests (schema validity, completeness).
- Avito’s approach to unified events and common fields is a good illustration: defining basic events and mandatory fields.
Step three: build a minimal DWH and marts (4–10 weeks)
- Goal: a single source of truth (orders from CRM + behavioral events).
- Make sure you can build: funnels, cohort analysis, retention. An example of a practical guide with ClickHouse + DataSphere + DataLens in Yandex Cloud shows the sequence of such steps.
Step four: add an AI layer with measurable value (6–16 weeks)
Choose 1–2 models:
- anomaly detection for key metrics (revenue/conversion/traffic),
- churn/propensity for one target funnel,
- an uplift pilot on one communication channel (email/push/onboarding) if experiments are available.
Important: causal models require experiment design and proper evaluation; otherwise, “AI” turns into guesswork.
Step five: activation (Reverse ETL, CRM, advertising) and the improvement loop (ongoing)
- Segments and scores must be sent back to CRM/CDP/ad platforms and measured by incrementality. The pattern “events → processing → activation” is shown both by Avito (export to external ad accounts) and by Transavia (reverse ETL).
Implementation roadmap with stages and timelines
2026-04-012026-05-012026-06-012026-07-012026-08-012026-09-012026-10-012026-11-01Goals, KPI, use-case backlogPersonal data perimeter, notifications, policyEvent taxonomy + contractsEvent collection + DWH (ClickHouse)Data marts + BIAnomaly detection (MVP)Propensity/churn (1 model)Reverse ETL + CRM/CDPData and model quality monitoringStrategy and complianceData and platformAI/MLActivation and scalingAI web analytics: typical roadmap Show code
This diagram is the “middle” path. In a minimal scenario, some stages are compressed (for example, contracts and data marts are simplified), while in a large-scale scenario separate tracks are added: real-time personalization, feature store, uplift experiments, data catalog, and access controls by sensitivity.
Budgets and resources: three scenarios
Since not specified: traffic/events/channels/real-time requirements, below are ranges and logic; exact figures can only be obtained after estimating event volume and storage/latency requirements.
Minimal scenario
Profile: SMB, one website, no mobile app, goal — “quickly get manageable analytics + 1 AI feature” (anomalies or simple propensity).
- Team (partially outsourced): 0.5–1 analyst, 0.5 data engineer, 0.2–0.5 ML/DS (periodically), 0.2 lawyer/personal data officer.
- Technologies: counter/collection + small ClickHouse + BI + notebooks/scheduler.
- Timeline: ~2–3 months to the first working AI use case.
Budget (very rough, assuming a mixed team): 1.5–4.5 million RUB for launch (work) + 50–300 thousand RUB/month for infrastructure/support (depends on event volume and perimeter).
If you use the enterprise package “Metriika Pro,” the license alone in the source starts at “from 300 thousand RUB/month” (cluster separately).
Medium scenario
Profile: growing e-commerce/service, web + possibly app (not specified), needs end-to-end analytics and 2–3 models (anomalies, churn/propensity, basic attribution).
- Team (in-house): product analyst/owner, 1 data engineer, 1 BI analyst, 1 ML engineer/DS, DevOps/SRE 0.5–1, lawyer/compliance 0.2–0.5.
- Timeline: 4–6 months to a stable perimeter (with data quality monitoring).
- Technologies: Kafka/streaming (if near-real-time is needed), ClickHouse, object storage, dbt/ETL, BI, monitoring.
Budget: 6–18 million RUB for 6 months (team + implementation) + 200 thousand–1.5 million RUB/month for infrastructure/licenses (the range is determined by events, retention, SLA).
Large-scale scenario
Profile: enterprise, many channels and sources, requirements for real-time personalization, in-house event platform, a full-fledged Governance system and data security.
- Team: 8–20 FTE (data platform, ML, analytics, product, security), plus contractors.
- Timeline: 9–18 months to full maturity (with continuous development).
- Technologies: composable CDP/event platform, feature store, model serving, uplift experiments, data catalog, data contracts, DQ portal.
Budget: from 40–150+ million RUB per year (highly dependent on scale). For reference: in the “Magnit + Manzana” case, hundreds of specialists on the project and hundreds of virtual machines are mentioned, which qualitatively demonstrates an “enterprise level.”
KPI and performance metrics for AI web analytics systems
It is important to distinguish platform KPI (how the system works) and business KPI (what has changed).
| LevelKPIHow to measureTarget “norm” (benchmark) | |||
| Data | Event completeness | % of mandatory fields filled; share of valid events by schema | 98–99.9% (depends on the source) |
| Data | Data latency | time from event to appearance in a data mart/dashboard | minutes for operational use; hours for regulations |
| Data | Deduplication/duplicates | share of duplicates by event_id/rules | <0.1–1% |
| Data | Tracking coverage | share of critical user journeys covered by events | 80–95% for MVP, then higher |
| Analytics | Time-to-insight | time from question to answer (self-service) | hours instead of days |
| Analytics | Adoption | MAU of active analytics users | growing metric |
| ML | Model quality | AUC/PR-AUC, calibration, stability | task-dependent, over time |
| ML | Drift | PSI/KS, metric drops, feature monitoring | thresholds + alerts |
| ML/Business | Uplift/incrementality | A/B, geo-experiments, uplift curves | positive uplift |
| Business | Conversion | CR across key funnels | growth vs baseline |
| Business | Retention | retention D7/D30/… | growth vs baseline |
| Business | Unit economics | LTV, CAC, LTV/CAC | improvement, but with margin taken into account |
| Business | Advertising efficiency | ROAS/CPA/ROMI | improvement under incrementality control |
How to tie KPI to money.
In international cases, “incremental” and operational metrics are often cited: for Transavia — incremental revenue and improvement in ROAS/CPA, reduction in licenses, and speed of use-case implementation; for Burberry — reduced data latency and improved personalization, which is linked to revenue growth. These examples are useful as a template for the structure of a KPI report.
Risks, law, and ethics
Russian data legislation: what specifically affects web analytics with AI
Disclaimer: below is an engineering and management analysis of the rules and their practical consequences; for a specific implementation, you need a personal data lawyer, because much depends on exactly what data you collect and how your contracts/processes are structured.
Personal data and the role of the “operator”
Federal Law 152-FZ defines:
- “personal data” — any information relating to an identified or identifiable natural person, directly or indirectly;
- “operator” — a legal entity/individual who organizes or carries out processing and determines the purposes/composition/operations;
- “processing” includes collection, recording, storage, transfer, anonymization, etc.;
- “cross-border transfer” — transfer of personal data to a foreign state/person/legal entity.
Practical takeaway for web analytics: if you collect through a website identifiers that can be linked to a person (especially when combined with CRM/orders), you are highly likely to become a personal data operator, and the analytics architecture must be designed as a personal data processing system.
Localization when collecting via the Internet
Part 5 of Article 18 of 152-FZ (the version in force after the 2025 amendments) explicitly states: when collecting personal data, including via the Internet, recording/systematizing/accumulating/storing/clarifying/retrieving personal data of Russian citizens using databases located outside the territory of the Russian Federation is not permitted, except for certain grounds (clauses 2, 3, 4, 8 of part 1 of Article 6).
Practical takeaway: if your web analytics in the “primary contour” writes data to an overseas storage system, this may be incompatible with the rule. In practice, business chooses one of the following patterns:
- fully Russian environment (Russian cloud/on-prem),
- “primary recording in the Russian Federation → then cross-border transfer” (if it is lawful and properly documented),
- anonymization/minimization to a level at which the data cease to be personal data (difficult and requiring careful design, plus the risk of reversible de-anonymization).
Notifications to Roskomnadzor and the register of operators
Article 22 of 152-FZ establishes the operator’s obligation, before starting processing, to notify the authorized body (RKN) of the intention to process personal data (there are exceptions that must be checked separately).
RKN supports electronic notification submission forms and a public register of personal data operators (official portal).
Practical takeaway: launching “end-to-end analytics with CRM” almost always involves personal data, so notification and correct documents (policy, consents, processor instructions) are not just “paperwork” but part of the project.
Cross-border transfer and notification
The official RKN portal states that a new procedure for informing about the cross-border transfer of personal data has been in effect since March 1, 2023; submission of a notification before such transfer begins is provided for.
Practical takeaway: if you use foreign storage/processing services, this becomes a separate compliance track.
Liability and fines
Federal Law No. 420-FZ of 30.11.2024 amended the Code of Administrative Offenses of the Russian Federation and introduced new offenses/strengthened liability; the official text is posted on the legal information portal.
Legal practice reviews indicate that as of May 30, 2025, increased fines apply, including for failure to submit notifications and for data breaches, and a progressive scale is introduced for major breaches.
Practical takeaway for AI analytics: the more data you combine (web + CRM + IDs), the more expensive a mistake becomes in security and incident response processes (incidents, notifications, access logs).
Ethical and product risks of AI web analytics
Risk of “automated decisions” affecting rights
Article 16 of 152-FZ prohibits making decisions that produce legal consequences or affect the rights/interests of a data subject on the basis of exclusively automated processing, except in certain cases (including when written consent is available).
Practical takeaway: if you use ML scoring for decisions like “deny the service,” “change the price individually,” “restrict access,” you need to separately check the legal model and provide explanation/objection mechanisms.
Risks to model quality and controllability (business safety)
- Data leakage and “training on the future”: the model shows excellent metrics in test, but does not work in production.
- Concept drift: channels/assortment/UX have changed, and the model degrades.
- Simpson’s paradox: aggregates improve, while an important segment declines.
- Pseudo-causality: a propensity model identifies “who would buy anyway,” while marketing gives discounts to those who would have bought without them.
A reliable way to fight this is through uplift experiments and incrementality control (described in works on uplift/causal approaches).
Risk of “dark patterns” and goal substitution
AI analytics easily optimizes “proxy metrics” (clicks, time on site) instead of business value and trust. Therefore KPI should include not only metric growth, but also constraints: complaints, unsubscribes, refunds, NPS/CSAT, and alignment with user expectations. In the Galamart case, it is explicitly stated that irrelevant mailings led to lower open/click rates and an increase in unsubscribes and complaints; this is a good example of how “communication optimization” must include risk metrics.
Practical compliance and ethics checklist for implementation
- Data map: which events/fields are personal data/can become personal data after stitching. (Not specified — depends on your identification model.)
- Localization: where Russian citizens’ data are first written, how the DWH and backups are arranged.
- Documents and notices: Article 22 RCN notice, contracts with processors, access regulations.
- Cross-border transfer: notice under the RCN procedure (if applicable).
- Model management: monitoring quality/drift, logging decisions, rollback process.
- Experiments: uplift/incrementality as the standard for decision-making on promotions/personalization.
- Data minimization: store what is needed for the purposes, and for as long as needed (retention policy).