Sorry for the long post, but I am hoping I have included everything necessary to get this situation resolved.
We are running Cacti 0.8.8h, Spine 0.8.8h, PHP 5.6.29, NET-SNMP 5.5, RRDTool 1.4.8, and everything else in the Technical Support page looks normal. We had a spare server that we migrated Cacti to using the step located here http://xmodulo.com/migrate-cacti-server.html
. The server is way over powered (24 cores and 140 GB of RAM). The server only shows 78 % CPU utilization and 3.6 GB of RAM used. We are polling 476 hosts that have 17,788 data sources, yet even with the settings below, polling still usually comes in at 58 seconds and rarely goes below 39 seconds.
poller.jpg [ 270.62 KiB | Viewed 407 times ]
In addition we keep seeing messages in the log about "SPINE: Poller ERROR: Spine Timed Out While Processing Hosts Internal" and "SPINE: Poller ERROR: SS PHP Script Server communications lost. Restarting PHP Script Server", and "SPINE: Poller ERROR: Spine Timed Out While Processing Hosts Internal" all of which had solutions on the internet of adjusting the max_connections in my.cnf which we did to no avail and checking for 2 pollers. We dug and dug and there is only 1 poller running anywhere and that is in the cron of the cactiuser and looks like: * * * * * umask 000; /usr/bin/php /usr/local/cacti/poller.php > /dev/null 2>&1
In the my.cnf file, we have max_connections = 1000 and max-heap-table-size = 1G. According to this page http://www.cacti.net/downloads/docs/html/using_spine.html
, "Maximum Concurrent Poller Processes" should be 1-2 times greater than the number of cores, but setting this to 24 or 48 definitely didn't get us under 1 minute. Despite that page saying that the "required settings for "Maximum Threads per Process" are 5-10, the only way we could get under a minute is setting that to 48 (2x the number of processors) as shown in the image below.
settings.jpg [ 243.5 KiB | Viewed 407 times ]
Running the repair_database script returned the following. I'm not sure if those issues could be the cause for the slowness? If so, what are the possible issues of running that command with a "--force"?
NOTE: Checking for Invalid Cacti Templates
NOTE: 1 Invalid CDEF Item Rows Found in Graph Templates
NOTE: 45950 Invalid Data Input Data Rows Found in Data Templates
WARNING: Cacti Template Problems found in your Database. Using the '--force' option will remove
the invalid records. However, these changes can be catastrophic to existing data sources. Therefore, you
should contact your support organization prior to proceeding with that repair.
As you may have seen in the image above, we have Boost and hmib installed. We could never get hmib running as it would never populate the rrds. We had Boost running on the old system, but the polling still didn't finish in under a minute. On this new system, when we try and enable Boost it doesn't process any records. It shows the next time they will be processed, that time comes and the Boost status shows no records processed, but the next scheduled processing time has bumped forward 30 minutes.
Are we really asking for two much? Is 40-58 seconds all that we can reasonably get with this setup? Any help would be greatly appreciated.