Cacti: Spine 1.1.38 unable to handle large disabled device

Post general support questions here that do not specifically fall into the Linux or Windows categories.

Moderators: Moderators, Developers

Post Reply
Author
Message
Rno
Cacti User
Posts: 325
Joined: Wed Dec 07, 2011 9:19 am

Cacti: Spine 1.1.38 unable to handle large disabled device

#1 Post by Rno » Mon Jun 25, 2018 5:54 am

I have a problem with my install. cacti config or mariadb wrong config !??!
I have 997 enable devices, that mean resources: 8618 RRDsProcessed:4796

it's working almost correctly.

but I'm also using Cacti to manage some endpoint, like ipPhone, I don't do anything on them, there are in disable state.
So far I have 5113 disabled device.

But in that config I have plenty of error from spine like that:
2018/06/25 11:17:36 - SYSTEM THOLD STATS: CPUTime:0 MaxRuntime:0 Tholds:0 TotalDevices:997 DownDevices:4 NewDownDevices:0 Processes: 0 completed, 0 running, 0 broken
2018/06/25 11:17:02 - SPINE: Poller[Main Poller] ERROR: Spine Timed Out While Waiting for Threads to End
2018/06/25 11:17:01 - SPINE: Poller[Main Poller] ERROR: Spine Timed Out While Waiting for Threads to End
2018/06/25 11:17:01 - POLLER: Poller[Main Poller] WARNING: There are '2' detected as overrunning a polling process, please investigate
2018/06/25 11:17:00 - SPINE: Poller[Main Poller] ERROR: Spine Timed Out While Processing Devices Internal
2018/06/25 11:17:00 - SPINE: Poller[Main Poller] ERROR: Spine Timed Out While Processing Devices Internal
2018/06/25 11:17:00 - SYSTEM STATS: Time:59.0163 Method:spine Processes:3 Threads:5 Hosts:997 HostsPerProcess:333 DataSources:8781 RRDsProcessed:3902
2018/06/25 11:17:00 - POLLER: Poller[Main Poller] Maximum runtime of 58 seconds exceeded. Exiting.
2018/06/25 11:16:03 - SPINE: Poller[Main Poller] ERROR: Spine Timed Out While Waiting for Threads to End
2018/06/25 11:16:03 - SPINE: Poller[Main Poller] ERROR: Spine Timed Out While Waiting for Threads to End
2018/06/25 11:16:02 - SPINE: Poller[Main Poller] ERROR: Spine Timed Out While Processing Devices Internal
2018/06/25 11:16:01 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 9, Graphs[se-ssi-13 - Traffic - Gi1/0/28 - (128 SE-SSI-14 / 2960S-24P / Int G1/0/25 ---), se-ssi-13 - Traffic - Gi1/0/28 - (128 SE-SSI-14 / 2960S-24P / Int G1/0/25 ---)] Graphs[se-ssi-13 Gi1/0/26 - Status - 126 SE-SSI-16 / 3560-24P / Int G0/1 --- , se-ssi-13 Gi1/0/26 - Status - 126 SE-SSI-16 / 3560-24P / Int G0/1 --- ] Graphs[se-ssi-13 - Traffic - Gi1/0/26 - (126 SE-SSI-16 / 3560-24P / Int G0/1 ---), se-ssi-13 - Traffic - Gi1/0/26 - (126 SE-SSI-16 / 3560-24P / Int G0/1 ---)] Graphs[se-ssi-13 - Traffic - Gi1/0/28 - (128 SE-SSI-14 / 2960S-24P / Int G1/0/25 ---), se-ssi-13 - Traffic - Gi1/0/28 - (128 SE-SSI-14 / 2960S-24P / Int G1/0/25 ---)] Graphs[se-ssi-13 - Traffic - Gi1/0/25 - (125 SE-SSI-12 / 2960S-24P / Int G1/0/28 ---), se-ssi-13 - Traffic - Gi1/0/25 - (125 SE-SSI-12 / 2960S-24P / Int G1/0/28 ---)] Graphs[se-ssi-13 Gi1/0/25 - Status - 125 SE-SSI-12 / 2960S-24P / Int G1/0/28 --- , se-ssi-13 Gi1/0/25 - Status - 125 SE-SSI-12 / 2960S-24P / Int G1/0/28 --- ] Graphs[se-ssi-13 Gi1/0/28 - Status - 128 SE-SSI-14 / 2960S-24P / Int G1/0/25 --- , se-ssi-13 Gi1/0/28 - Status - 128 SE-SSI-14 / 2960S-24P / Int G1/0/25 --- ] Graphs[se-ssi-13 - Traffic - Gi1/0/26 - (126 SE-SSI-16 / 3560-24P / Int G0/1 ---), se-ssi-13 - Traffic - Gi1/0/26 - (126 SE-SSI-16 / 3560-24P / Int G0/1 ---)] Graphs[se-ssi-13 - Traffic - Gi1/0/25 - (125 SE-SSI-12 / 2960S-24P / Int G1/0/28 ---), se-ssi-13 - Traffic - Gi1/0/25 - (125 SE-SSI-12 / 2960S-24P / Int G1/0/28 ---)] DS[se-ssi-13 - Traffic - Gi1/0/28 128 SE-SSI-14 / 2960S-24P / Int G1/0/25 ---, se-ssi-13 Gi1/0/26 - Status - 126 SE-SSI-16 / 3560-24P / Int G0/1 --- , se-ssi-13 - Traffic - Gi1/0/26 126 SE-SSI-16 / 3560-24P / Int G0/1 ---, se-ssi-13 - Traffic - Gi1/0/28 128 SE-SSI-14 / 2960S-24P / Int G1/0/25 ---, se-ssi-13 - Traffic - Gi1/0/25 125 SE-SSI-12 / 2960S-24P / Int G1/0/28 ---, se-ssi-13 Gi1/0/25 - Status - 125 SE-SSI-12 / 2960S-24P / Int G1/0/28 --- , se-ssi-13 Gi1/0/28 - Status - 128 SE-SSI-14 / 2960S-24P / Int G1/0/25 --- , se-ssi-13 - Traffic - Gi1/0/26 126 SE-SSI-16 / 3560-24P / Int G0/1 ---, se-ssi-13 - Traffic - Gi1/0/25 125 SE-SSI-12 / 2960S-24P / Int G1/0/28 ---]
2018/06/25 11:16:01 - POLLER: Poller[Main Poller] WARNING: There are '2' detected as overrunning a polling process, please investigate
2018/06/25 11:16:00 - SYSTEM STATS: Time:59.1470 Method:spine Processes:3 Threads:5 Hosts:997 HostsPerProcess:333 DataSources:8391 RRDsProcessed:4491


And just removing those 5113 device and I have :
2018/06/25 12:48:30 - SYSTEM STATS: Time:28.2434 Method:spine Processes:3 Threads:5 Hosts:997 HostsPerProcess:333 DataSources:8618 RRDsProcessed:4796
2018/06/25 12:47:36 - SYSTEM THOLD STATS: CPUTime:0 MaxRuntime:0 Tholds:0 TotalDevices:997 DownDevices:3 NewDownDevices:0 Processes: 0 completed, 0 running, 0 broken


Any clue how to fix this ?
CentOS
Production
Cacti 0.8.8h
Spine 0.8.8h
PIA 3.1
Aggregate 0.75
Monitor 1.3
Settings 0.71
Weathermap 0.98
Thold 0.5
rrdclean 0.41

Own plugin: LinkDiscovery 0.3, Map 0.4

Test
Cacti 1.2.1
Spine 1.2.1
thold 1.0.6
monitor 2.3.5
php 7.2.11
mariadb 5.5.56
Own plugin:
ExtendDB 1.1.2
LinkDiscovery 1.2.4
Map 1.2.5

netniV
Cacti Guru User
Posts: 2682
Joined: Sun Aug 27, 2017 12:05 am

Re: Cacti: Spine 1.1.38 unable to handle large disabled devi

#2 Post by netniV » Mon Jun 25, 2018 10:32 am

So the bits that stand out to me are:
SYSTEM STATS: Time:28.2434 Method:spine Processes:3 Threads:5 Hosts:997 HostsPerProcess:333 DataSources:8618 RRDsProcessed:4796
vs
SYSTEM STATS: Time:59.0163 Method:spine Processes:3 Threads:5 Hosts:997 HostsPerProcess:333 DataSources:8781 RRDsProcessed:3902

The first thing is, you said you removed 1000+ devices when you only have 997 in total. The second is, you should increase your thread count for spine. I have mine at 30 which works OK for me so you should try increasing that. It is basically the number of devices you have being polled at the same time, within that process. So, if 5 consecutive hosts are timing out, it has to wait for all five to timeout before it moves on to the next one.

Rno
Cacti User
Posts: 325
Joined: Wed Dec 07, 2011 9:19 am

Re: Cacti: Spine 1.1.38 unable to handle large disabled devi

#3 Post by Rno » Tue Jun 26, 2018 12:05 am

Well I have 5113 devices in Cacti, only 997 are active, that what I said, the other are disabled and used as inventory !

Second point I will try with 30 thread, but last I did that kind of testing give me problem, since all process try to access a single table with the update, so the bottleneck was the Database.
That's why I try to go the other way around, and try to find the smallest thread number, to avoid dtabase lock issue.

And by the way the design of large scale cacti is a known problem (https://github.com/Cacti/cacti/issues/1060)
CentOS
Production
Cacti 0.8.8h
Spine 0.8.8h
PIA 3.1
Aggregate 0.75
Monitor 1.3
Settings 0.71
Weathermap 0.98
Thold 0.5
rrdclean 0.41

Own plugin: LinkDiscovery 0.3, Map 0.4

Test
Cacti 1.2.1
Spine 1.2.1
thold 1.0.6
monitor 2.3.5
php 7.2.11
mariadb 5.5.56
Own plugin:
ExtendDB 1.1.2
LinkDiscovery 1.2.4
Map 1.2.5

netniV
Cacti Guru User
Posts: 2682
Joined: Sun Aug 27, 2017 12:05 am

Re: Cacti: Spine 1.1.38 unable to handle large disabled devi

#4 Post by netniV » Tue Jun 26, 2018 3:18 am

That design problem is more when working with multiple remote pollers. Since the remote pollers actually try to update the main database. The changes are designed to make the local pollers use their own local database and have the main server query the DB on the poller as needed (if I remember correctly).

I still think that if you implement the three things i suggested, you should see improvements. At the very least, try them out and post the SYSTEM STATS messages from before and after so we can see the results.

eholz1
Cacti User
Posts: 129
Joined: Mon Oct 01, 2018 10:09 am

Re: Cacti: Spine 1.1.38 unable to handle large disabled devi

#5 Post by eholz1 » Wed Jan 30, 2019 3:07 pm

Hello,
I have the same problem - here - searched hi and low, with no real fix. Constantly get poller table not empty warnings.
Also have Juniper MX-80s that ALWAYS give snmp timeouts and gaps in graphs.
I have 60 disabled devices, and 388 enabled devices. Most are routers and switches. Does anyone have a clue on how these warnings can be minimized or eliminated? Using cacti-1.1.38, spine 1.1.38, cacti cron job, mysql 5.3.3, php 5.3, redhat 6.10

I have attached a view of the warnings in the log:
2019/01/30 20:01:40 - SYSTEM THOLD STATS: Time:0.0280 Tholds:2 TotalDevices:388 DownDevices:42 NewDownDevices:0
2019/01/30 20:01:39 - SYSTEM STATS: Time:11.3700 Method:spine Processes:2 Threads:30 Hosts:388 HostsPerProcess:194 DataSources:2510 RRDsProcessed:1285
2019/01/30 20:01:28 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 17, Graphs[stl-73-307-rsw - Traffic 30sec - Te1/13 , stl-73-307-rsw - Traffic 30sec - Te1/13 ] Graphs[stl-73-307-rsw - Traffic 30sec - Te1/14 , stl-73-307-rsw - Traffic 30sec - Te1/14 ] Graphs[sea-2-25-asw8 - Traffic - Gi1/1/4, sea-2-25-asw8 - Traffic - Gi1/1/4] Graphs[sea-2-25-asw8 - Traffic - Gi1/1/4, sea-2-25-asw8 - Traffic - Gi1/1/4] Graphs[|host_description| - Traffic 30sec - |query_ifName| , |host_description| - Traffic 30sec - |query_ifName| ] Graphs[|host_description| - Traffic 30sec - |query_ifName| , |host_description| - Traffic 30sec - |query_ifName| ] Graphs[msa-50-531-rtr3 - Traffic - Gi0/1, msa-50-531-rtr3 - Traffic - Gi0/1] Graphs[msa-50-531-rtr3 - Traffic - Gi0/1, msa-50-531-rtr3 - Traffic - Gi0/1] Graphs[sea-2-31-asw7 - Traffic - Gi1/1/4, sea-2-31-asw7 - Traffic - Gi1/1/4] Graphs[sea-2-31-asw7 - Traffic - Gi1/1/4, sea-2-31-asw7 - Traffic - Gi1/1/4] Graphs[evt-40-40-asw2 - Traffic - Gi1/1, evt-40-40-asw2 - Traffic - Gi1/1] Graphs[evt-40-40-asw2 - Traffic - Gi1/1, evt-40-40-asw2 - Traffic - Gi1/1] Graphs[kor-99-d35-esw1 - Traffic - Te1/0/23, kor-99-d35-esw1 - Traffic - Te1/0/23] Graphs[kor-99-d35-esw1 - Traffic - Te1/0/23, kor-99-d35-esw1 - Traffic - Te1/0/23] Graphs[kor-99-d35-esw1 - Traffic - Te1/0/4, kor-99-d35-esw1 - Traffic - Te1/0/4] Graphs[kor-99-d35-esw1 - Traffic - Te1/0/4, kor-99-d35-esw1 - Traffic - Te1/0/4] Graphs[sbc-52-880-steven - Traffic 30sec - Vl11 , sbc-52-880-steven - Traffic 30sec - Vl11 ] DS[stl-73-307-rsw - Traffic 30sec - Te1/13, stl-73-307-rsw - Traffic 30sec - Te1/14, sea-2-25-asw8 - Traffic - Gi1/1/4, sea-2-25-asw8 - Traffic - Gi1/1/4, sea-9-53-rsw - Traffic 30sec - Te1/18, sea-9-53-rsw - Traffic 30sec - Te1/19, msa-50-531-rtr3 - Traffic - Gi0/1, msa-50-531-rtr3 - Traffic - Gi0/1, sea-2-31-asw7 - Traffic - Gi1/1/4, sea-2-31-asw7 - Traffic - Gi1/1/4, evt-40-40-asw2 - Traffic - |query_ifName|, evt-40-40-asw2 - Traffic - |query_ifName|, kor-99-d35-esw1 - Traffic - Te1/0/23, kor-99-d35-esw1 - Traffic - Te1/0/23, kor-99-d35-esw1 - Traffic - Te1/0/4, kor-99-d35-esw1 - Traffic - Te1/0/4, sbc-52-880-steven - Traffic 30sec - 10.48.11.7 - Vl11 ]
2019/01/30 20:01:11 - SYSTEM THOLD STATS: Time:0.0261 Tholds:2 TotalDevices:388 DownDevices:42 NewDownDevices:0
2019/01/30 20:01:10 - SYSTEM STATS: Time:11.3747 Method:spine Processes:2 Threads:30 Hosts:388 HostsPerProcess:194 DataSources:2510 RRDsProcessed:1298
2019/01/30 20:00:59 - POLLER: Poller[Main Poller] WARNING: Poller Output Table not Empty. Issues: 20, Graphs[stl-73-307-rsw - Traffic 30sec - Te1/16 , stl-73-307-rsw - Traffic 30sec - Te1/16 ] Graphs[stl-73-307-rsw - Traffic 30sec - Te1/15 , stl-73-307-rsw - Traffic 30sec - Te1/15 ] Graphs[sbc-52-880-steven - Traffic 30sec - Vl11 , sbc-52-880-steven - Traffic 30sec - Vl11 ] Graphs[hbc-37-22-asw16 - Traffic - Gi1/1/4, hbc-37-22-asw16 - Traffic - Gi1/1/4] Graphs[hbc-37-22-asw16 - Traffic - Gi1/1/4, hbc-37-22-asw16 - Traffic - Gi1/1/4] Graphs[|host_description| - Traffic 30sec - |query_ifName| , |host_description| - Traffic 30sec - |query_ifName| ] Graphs[evt-40-40-asw2 - Traffic - Gi0/1, evt-40-40-asw2 - Traffic - Gi0/1] Graphs[evt-40-40-asw2 - Traffic - Gi0/1, evt-40-40-asw2 - Traffic - Gi0/1] Graphs[kor-99-d35-esw1 - Traffic - Te1/0/22, kor-99-d35-esw1 - Traffic - Te1/0/22] Graphs[kor-99-d35-esw1 - Traffic - Te1/0/22, kor-99-d35-esw1 - Traffic - Te1/0/22] Graphs[kor-99-d35-esw1 - Traffic - Te1/0/24, kor-99-d35-esw1 - Traffic - Te1/0/24] Graphs[kor-99-d35-esw1 - Traffic - Te1/0/24, kor-99-d35-esw1 - Traffic - Te1/0/24] Graphs[|host_description| - Traffic 30sec - |query_ifName| , |host_description| - Traffic 30sec - |query_ifName| ] Graphs[sea-9-120-asw5 - Traffic - Gi0/24, sea-9-120-asw5 - Traffic - Gi0/24] Graphs[sea-9-120-asw5 - Traffic - Gi0/24, sea-9-120-asw5 - Traffic - Gi0/24] Graphs[msa-50-531-rtr3 - Traffic - Gi0/2, msa-50-531-rtr3 - Traffic - Gi0/2] Graphs[msa-50-531-rtr3 - Traffic - Gi0/2, msa-50-531-rtr3 - Traffic - Gi0/2] Graphs[msa-50-531-rtr3 - Traffic - Gi0/0, msa-50-531-rtr3 - Traffic - Gi0/0] Graphs[msa-50-531-rtr3 - Traffic - Gi0/0, msa-50-531-rtr3 - Traffic - Gi0/0] Graphs[knt-18-61-asw4 - Traffic 30sec - Gi1/1 , knt-18-61-asw4 - Traffic 30sec - Gi1/1 ] DS[stl-73-307-rsw - Traffic 30sec - Te1/16, stl-73-307-rsw - Traffic 30sec - Te1/15, sbc-52-880-steven - Traffic 30sec - 10.48.11.7 - Vl11 , hbc-37-22-asw16 - Traffic - Gi1/1/4, hbc-37-22-asw16 - Traffic - Gi1/1/4, sea-9-53-rsw - Traffic 30sec - Te1/18, evt-40-40-asw2 - Traffic - |query_ifName|, evt-40-40-asw2 - Traffic - |query_ifName|, kor-99-d35-esw1 - Traffic - Te1/0/22, kor-99-d35-esw1 - Traffic - Te1/0/22, kor-99-d35-esw1 - Traffic - Te1/0/24, kor-99-d35-esw1 - Traffic - Te1/0/24, sea-9-53-rsw - Traffic 30sec - Te1/19, sea-9-120-asw5 - Traffic - Gi0/24, sea-9-120-asw5 - Traffic - Gi0/24, msa-50-531-rtr3 - Traffic - |query_ifIP| - Gi0/2, msa-50-531-rtr3 - Traffic - |query_ifIP| - Gi0/2, msa-50-531-rtr3 - Traffic - |query_ifIP| - Gi0/0, msa-50-531-rtr3 - Traffic - |query_ifIP| - Gi0/0, knt-18-61-asw4 - Traffic 30sec - Gi1/1]
2019/01/30 20:00:42 - SYSTEM THOLD STATS: Time:0.0253 Tholds:2 TotalDevices:388 DownDevices:42 NewDownDevices:0

eholz1 - not empty poller table!

User avatar
Osiris
Cacti Pro User
Posts: 835
Joined: Mon Jan 05, 2015 10:10 am

Re: Cacti: Spine 1.1.38 unable to handle large disabled devi

#6 Post by Osiris » Sun Feb 03, 2019 7:37 am

There is a small issue with reindexing in 1.2.x. Looks like there will be remediation in 1.2.2 I think.
Before history, there was a paradise, now dust.

Post Reply