Time:63.1088 Method:spine Processes:16 Threads:32 Hosts:8189 HostsPerProcess:512 DataSources:3458700 RRDsProcessed:1069607 SystemLoad5mAvg:33.8
Ubuntu 18.04 LTS (GNU/Linux 4.15.0-20-generic x86_64) - Vanilla kernel
MariaDB 10.1
Dual Xeon Gold 6152, 22 Cores @ 2.1Ghz each CPU
192GB DDR4 2666Mhz RAM
2 x 240GB SSD's - 12Gbs SAS - RAID 1 (Database and O/S)
2 x Samsung PM1725a 800GB M.2 NVME Drives (RRD storage only)- Software RAID 0 using MDADM (RRD's are backed up nightly) - XFS filesystem and mounted with noatime.
Cacti Version 1.38
Cacti OS unix
RSA Fingerprint
NET-SNMP Version NET-SNMP version: 5.7.3
RRDtool Version RRDtool 1.7.x
Devices 8245
Graphs 1084712
Data Sources Script/Command: 46
SNMP Get: 8257
SNMP Query: 1068248
Script Query: 4
Script Server: 8055
Script Query - Script Server: 110
Total: 1084720
Interval 300
Type SPINE 1.1.35 Copyright 2004-2017 by The Cacti Group
Items Action[0]: 3450521
Action[1]: 54
Action[2]: 8125
Total: 3458700
Concurrent Processes 16
Max Threads 32
PHP Servers 10
Script Timeout 60
Max OID 60
MemTotal 192.07 K MB
MemFree 3.45 K MB
Buffers 732.58 MB
Cached 148.83 K MB
Active 138.21 K MB
Inactive 39.75 K MB
SwapTotal 2.05 K MB
SwapFree 2.03 K MB
This system is not using boost - with NVME there is no need and would probably slow the system down. However in order to make full use of the system resources available I've had to make modifications to the batch size poller.php fetches from poller_output table. I have increased $max_rows from 40k to 1.5million (You will need to disable or increase php memory limits in the scripts to do this). I have also replaced the path to the RRDTool binary with a python script that splits the batch into equal chunks then spawns multiple RRDTool processes. This was necessary as a single RRDTool process was not able to saturate the write speed of the NVME drives properly.
MariaDB also took considerable tuning at this scale in order to stop spine from getting 2013 'Lost connection' errors and be able to deal with the concurrency. After tuning this in conjuction with spine it is now rock solid. (config below)
The run time for spine is about 20 - 25 seconds. The remaining total polling time is poller.php working its way through the entries in the poller_output table. It would be good if cacti had parallelization built into this part of the system - spine is highly parallel when collecting the data but poller.php is single threaded and calls a single rrdtool instance which bottlenecks the poller and doesn't effectively use modern multicore hardware or SSD disks.
After my modifications I would suggest this system could probably take twice the number of RRD's and still produce good polling times (with additional RAM). RAM size is critical to performance by holding hot portions of the RRD's in disk cache. Without sufficient RAM the system would swap these pages or evict them and poll times would suffer.
[email protected]:~# vmtouch /nvme/rra
Files: 1083217
Directories: 1
Resident Pages: 37501349/105650472 143G/403G 35.5%
Elapsed: 32.506 seconds
Maria DB Configuration
#
# * Fine Tuning
#
key_buffer_size = 256M
max_allowed_packet = 1G
net_read_timeout = 600
net_write_timeout = 180
wait_timeout = 86400
interactive_timeout = 86400
join_buffer_size = 512
max_heap_table_size = 8G
tmp_table_size = 8G
net_retry_count = 20
# Thread Pool Configuration
thread_handling = pool-of-threads
thread_pool_idle_timeout = 250
thread_pool_max_threads = 1500
thread_pool_size = 88
thread_concurrency = 44
thread_stack = 192K
# Back Log increases wait time(ms) in queue for clients connecting.
back_log = 3000
# This replaces the startup script and checks MyISAM tables if needed
# the first time they are touched
myisam_recover_options = BACKUP
max_connections = 5000
max_connect_errors = 10000
table_cache = 8128
#
# * Query Cache Configuration
#
query_cache_limit = 8M
query_cache_size = 256M
query_cache_type = 1
#
# * InnoDB
#
# InnoDB is enabled by default with a 10MB datafile in /var/lib/mysql/.
# Read the manual for more InnoDB related options. There are many!
innodb_doublewrite = OFF
innodb_flush_neighbors=0
innodb_buffer_pool_size=30G
innodb_buffer_pool_instances=30
innodb_log_file_size=3G
innodb_additional_mem_pool_size=80M
innodb_flush_log_at_timeout = 3
innodb_read_io_threads = 64
innodb_write_io_threads = 64
innodb_log_buffer_size = 16M
Additional system settings:
/etc/sysctl.conf
vm.swappiness = 1
net.ipv4.tcp_max_syn_backlog = 8192
ulimit openfile limits have been increased as well.
I hope this helps people make informed decisions and scale their systems effectively. I noticed a lack of documentation online from people with really big installations so thought I would share my findings.