Multiple instances for Asset Analytics
As a PI administrator I would like to be able to create multiple instances of "PI Analysis Service" windows service for these reasons:
- avoid new potentially problematic analysis from affecting the ones that are operational and working correctly by setting one production instance and one test instance for example.
- every instance could have its own log files and we could move some analyses to a new instance to troubleshoot them
- make possible to distribute the resources usage for asset analytics on more than one server
This feature is heavily needed.
When building up central servers like AF you are facing questions about how you can guarantee that some random changes from one user does not affect the work of another user. For AF itself this can be done by the usage of different AF Databases. But for Analysis we don't have this possibility right now.
A single user could kill the analysis service by the rollout of thousand calculations done eventtriggered (default) instead of periodically. This is quite some risk, different user are not willing to take. Therefore we need to maintain a seperate AF and SQL Server for these users increasing our cost and maintenance efforts unnecessarily.
James Meade commented
Supported by my EA customers ( Noble and ISO New England as a needed feature)
Be aware of dependent calculations distributed across multiple Analysis Service servers for recalculation purposes.
Define a solution to support a Multi-Server Analysis Solution.
Thanks for your feedback. We're looking at your Usecase1 right now via tech support escalation. As to Usecase3, the upcoming release should help you if your CPU is not being fully utilized. We'll reach out to you via your existing tech support case (Usecase1).
I'm not sure multiple instances is the right solve, but up-voting all the same due to use-cases presented.
Usecase1: “Avoid new potentially problematic analysis from affecting others”
Disturbing existing production analysis while committing template/analysis changes is a real problem, and even more so when testing the deployment of new functionality that must be done against the production database.
Typical-case during a template edit, is all dependent analysis and event frames restart and ALL event processing halts for 10-30mins, which is not very favorable.
Worse-case is when all or near-all analysis--despite any dependencies—get into a perpetual re-start or frail to start calculating again even though they show started, forcing us to restart the analysis service.
I feel we are being hyper-nice in our analysis architecture, yet the smallest of daily changes can cause immediate delay impact on all production analysis and notifications, and even cascade to into a full failure. Below are some of the niceness:
- 250k’ish analysis running AF Analysis Service under 2018 SP1 on dedicated analysis servers
- Extremely robust servers with large calc treads and data pipes, with deep events cache and eval queue
- 100% periodic scheduled analysis (no event-base) with 10 second offsets between each layer of input dependencies, and 2 sec calculation wait time
- All analysis outputs are PI Points if used as input dependencies to other analysis
- 0% analysis are in error
- 100% template driven analysis
- Very cautious input/data validation and protection using badval(), FindEq(), FindNE(), and TagAvg() to reduce possible “Calc Failed” issues
Usecase2: “every instance could have its own log files”
Trouble shooting issues with high levels of TRACE logging on causes instability in analysis, which compounds any issue trying to be troubleshot. Being able to enabling template or element specific logging levels would allow for stable troubleshooting, much quick human evaluation of the logs, as well reduced time FTP’ing logs back to OSIsoft for troubleshooting.
Right now troubleshooting open issues with analysis is painful, as we cannot leave logging on TRACE levels and ‘hope’ to encounter the issue through daily use (as they were encountered in the first place) given the instability and log file sizes size generated. This causes us to try and replicate issues by forcing changes or conditions on a production database that we anticipated to encounter issues while the logging is on TRACE levels. Focused logging for a specific element or template type would allow us to level logging on high levels and enable us to find issues without guessing at or forcing conditions on the database to replicate the issue.
Usecase3: “make possible to distribute the resources”
Further increasing evaluation and writing threads has a negative effect, yet our analysis servers are bored (<15% AVG utilize w/ 5% STDEV) and PI servers are even lower use. However, even after a very quick stop/start of the analysis services—which happens frequently--backfilling still take 2-3hrs. So we are bottleneck somewhere, yet there are a lot more CPU, memory, network resources to throw at the problem, in addition to secondary servers of which could further be distributed across.
Stumbled upon this again. We would need multiple instances to properly manage larger systems. Now the only way to do that is to have separate into multiple AF Servers.
It would already be a great help if you could tie a single Analysis service to a single AF database, instead of an AF Server. That can be used to do functional segmentation, but without the hassle of also segmenting the entire AFserver
In response to Wilson Laizo Filho, "We had before, but we shouldn't have any..."
I started writing a long reply back to you but then I realized that we may not be talking about the same thing. Is your startup time slow, meaning when you restart PI Analysis Service? Or do you mean something else like the rollup analysis takes longer to complete than what you would like?
In response to Stephen Kwan, "Hi Wilson, Thanks for the info. I'm c..."
We had before, but we shouldn't have anymore. We changed the process and every time a new asset is created we Create and Update all PI Point data References and only after that we exclude the attributes we are not using.
We may have PI Points without any values or with a System Digital State, but all PI Points should have been created.
In response to Wilson Laizo Filho, "Hi Steve, in my case I'm using 2017R2. I..."
Thanks for the info. I'm curious if you have missing PI Points in your rollups?
In response to Stephen Kwan, "Hi Wilson, which version are you using? ..."
Hi Steve, in my case I'm using 2017R2. I think I already mentioned before but our case is the sheer amount of calculations that is the issue. We have few data flowing, but a lot of calculations and they are mostly rollups.
In fact, in our case one big help would be if we could select the trigger for rollups, right now we may have 100 triggers for a single rollup, but most of the data come at the same time, so if we could chose like only 5 to be triggers it would start faster than what we have now. We are using some "tricks" like excluding attributes that we are not using (not all attributes have values in our case, but they all have tags) but it still takes some time to start.
And in our case our DA is OK and our AF is also OK and not reaching any limit right now.
We already had someone from OSI to come locally and we implemented all the suggestions and it improved a lot but it would still benefit a lot from some sort of Load Balance or Split load between servers. Like I said before, for us a way to link more than one Analysis to the same AF server would be probably the ideal scenario right now.
In response to Wilson Laizo Filho, "Hi Steve, of course I don't know the int..."
Hi Wilson, which version are you using? The 2017 R2 and 2018 releases both have a lot of improvements to: 1) help identify the root cause of lags and 2) remove several bottlenecks. In addition, the 2018 release has a nifty feature that when you do a Preview (and also Evaluate) PI System Explorer gives you additional information on how long an execution took.
I'm not disputing that scaling out PI Analysis Service would help in some instances, but I want to make sure we understand the real reason behind any lag. If signups are slow, then is it possible the PI Data Archive is over-worked? In that case, multiple instances of PI Analysis Service make cause additional issues with your (overall) PI System as a whole. Often times, what we find is that users can remove a lot of lag by ensuring they don't unnecessarily tax their PI System. For example, do you have missing PI Points in your inputs such that you're causing excess round trips to query for a PI Point? Are you doing year-to-date summaries with every execution cycle? Are you repeated doing a sub-calculation when you can just do it once and reference it as a "Variable"?
Not trying to avoid the problem, but rather I'm interested in making sure we solve the right problem.
In response to Stephen Kwan, "If by adding RAM and CPU provide no sign..."
Hi Steve, of course I don't know the internal code for PI Analysis, but for example if PI Analysis limits the number of simultaneous calls to a PI server, increasing RAM or CPU would not help necessarily.
For example, in our case RAM only helps the performance up to the point where we are not using 100% of the memory, after that, even if we increase the memory it does not help and we see that our PI Analysis server takes a while to do all the subscriptions (on startup in this example). In our case if multiple servers are subscribing independently, we should see the overall "subscription process" been faster. And I believe the same would happen with the data been processed by the calculation.
Also, in the same example above, our Analysis process don't go over 30%, so even if we add more CPUs, that would not help either.
In response to Lal Babu Shaik, "We have configured all calculations in A..."
If by adding RAM and CPU provide no significant change, then it's likely that the bottleneck is not with PI Analysis Service. If that were the case, I don't necessarily think that multiple instances of PI Analysis Service would necessarily remove the bottleneck. Do you know why your calculations are lagging for 12 hours? Do you have a very large dependency chain such that you're being blocked by an expensive analysis? Are you doing large summaries that is taking a long time? Are you triggering faster than the end-to-end calculation time of your analyses? I would suggest that you call tech support so they can help you pinpoint the problem.
We have configured all calculations in AF analysis moving from ACE. As number of calculations increase which includes Max and Min over week analysis is unable to calculate on time and performance is degraded when we have data coming between 1sec and 60 sec. Analysis is behind 12 hours with approx 15k calculations. Added additional RAM and CPU but no drastic change. If we have load balancing with multiple AF analytic server for calculation then it would help customers with right data than waiting for 12 hours to know what exactly has happened. Waiting for an update at the earliest.
We need this asap
I agree here. We need to have load balancing on the services level.
The size of our system is always growing, and so is the requirement for more analytics. We have over 400k analytics in a single AF database! This makes it really hard to back-fill the analytics without affecting real-time analytic calculations. It would be great if we could spread the load across multiple analytic servers. One would process recalculation requests and the other would process the real-time analytics calculations. That is just one example. I am sure there are other use cases.
In response to John Messinger, "This is most definitely needed. I have a..."
I do think John's idea is great. In fact, we were expecting it as a first step before a true HA setup. Meaning, the ability to separate the analysis in different servers. I haven't thought on the group idea before, but I do think that it would be great, nowadays we kind of do that using Categories, but with the number of analysis we have it gets really slow overall.
And I understand that OSIsoft implemented HA with Windows Cluster, but that is really not a viable solution for us as our experience with it (in other cases) was not satisfactory as it would change the nodes unexpectedly and that would make our system be unresponsive for a while (given the time it takes to restart the services).
This is most definitely needed. I have a few customers here where this type of load balancing would make a significant difference to analysis performance.
As a variation to this idea, what about the option of Analysis pools, where a group of Analyses could be configured to run in a higher or lower priority pool, or even on a separate Analysis server? Even being able to target a specific AF database on a given Analysis server - when you have a need to restart the Analysis service for the calculations in a given database, you don't necessarily want analyses in other AF databases to be affected.