New Cacti Architecture (0.8.8) - RFC Response Location

Anything that you think should be in Cacti.

Moderators: Moderators, Developers

Author
Message
User avatar
TheWitness
Developer
Posts: 14804
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

New Cacti Architecture (0.8.8) - RFC Response Location

#1 Post by TheWitness » Sun Dec 16, 2007 2:01 pm

All,

Please submit your RFC comments here. Thanks for your participation. I will attach newer versions of the RFC and provide feedback in this post.

Regards,

TheWitness
Attachments
Cacti Multiple Poller Design v1.0.pdf
(338.71 KiB) Downloaded 5618 times
Last edited by TheWitness on Mon Feb 09, 2009 10:34 pm, edited 2 times in total.
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of MacTrack, Boost, CLog, SpikeKill, Platform RTM, DSStats, maintainer of Spine, lot's of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Central Plugin Repository
Central Templates Repository


I'm still out there people. Getting excited for Cacti 1.2. I think it will be a great release.

User avatar
gandalf
Developer
Posts: 22375
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

#2 Post by gandalf » Sun Dec 16, 2007 3:11 pm

Hi Larry,

thank you for opening this long overdue discussion. Beneath "automation" (yes, I know ...) was this the most deeply discussed topic of 2.CCC.eu.

But personally, I have a problem with this picture. It shows very nice where everything's happening from the data point of view. (BTB: I suppose, it would be better to either use poller_item _or_ poller_cache in picture and text. Personally, I prefer poller_item as this is the name of the real table.)

But I am not sure to fully understand where the application logic lies, in other words: what about the "workflow"?

I suppose. that each Poller Group is governed by a local crontab (or a "real" daemon) that fetches data from the db server (either directly or via table replication). Output is stored in local poller_output table and replicated to the db server? What would be a criteria to associate a host/data source to some specific Poller Group? I suppose, that clock synchronization (ntp?) would be required, as the local time used for poller_output would be used for rrdtool update. What about timezones? Where should hooks like "poller_bottom" be executed?

And another part of the logic would lie with the http servers, for sure (console aka administration). But where lies the rrdtool update logic? With the database server? Or with the RRDTool servers? Would RRDfile Update Groups equal Poller Groups? If not, what would be the criteria this time?

The rrdfile data pipeline surely would be used for graphing. But a lot of plugins currently require access to rrd files as well. So there would be more than graphing only. When running from load-balanced http instances, either graph caching will fail or cache must reside on RRDfile Servers. As there are two different pipes to RRDfile Servers, perhaps synchronization between updates and graph (rrdtool fetch) is necessary.

That's my first thoughts. Surely more will follow

Reinhard

User avatar
TheWitness
Developer
Posts: 14804
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

#3 Post by TheWitness » Sun Dec 16, 2007 3:54 pm

gandalf wrote:thank you for opening this long overdue discussion. Beneath "automation" (yes, I know ...) was this the most deeply discussed topic of 2.CCC.eu.
Yes, long overdue.
gandalf wrote:But personally, I have a problem with this picture.
RFC's often times start that way.[/quote]
gandalf wrote:It shows very nice where everything's happening from the data point of view. (BTB: I suppose, it would be better to either use poller_item _or_ poller_cache in picture and text. Personally, I prefer poller_item as this is the name of the real table.)
I litterally "slapped" this together. I will correct that in v2.
gandalf wrote:But I am not sure to fully understand where the application logic lies, in other words: what about the "workflow"?
Yes, after sending it out, I realized I left that out. Basically, the poller, will, by default use the main servers poller_item table, for it's list of poller items. If for some reason, the main server is not reachable, it will use it's local copy and store the poller_output table locally.

The same is intended for the poller_output table. Central server first. The remote pollers will be provided instructions to "update/synchronize" their local poller_items table periodically (aka when things change). Those synchronizations would not happen any more often that every 5 minutes.

If the central server is not available, then each poller will cache the updates in their poller_output tables until such time as the remote connection is available, then it will dump sequntially, by date to the central server. By doing so, no data will be lost.
gandalf wrote:What would be a criteria to associate a host/data source to some specific Poller Group?
There would have to be modifications to the poller_output table, or another table to keep track of when it is time to poll things. That is more TBD, until I have more feedback.
gandalf wrote:I suppose, that clock synchronization (ntp?) would be required, as the local time used for poller_output would be used for rrdtool update.
Of course...
gandalf wrote:What about timezones?
You tell me... :)
gandalf wrote:Where should hooks like "poller_bottom" be executed?
I need feedback from people like Howie and yourself to determine "where" that should be. So, see my comment above. It's an RFC you know ;)
gandalf wrote:And another part of the logic would lie with the http servers, for sure (console aka administration). But where lies the rrdtool update logic? With the database server? Or with the RRDTool servers?
The RRDfile Services will be asynchronous and running as daemons. They will process all items in the poller_output table as they come in and handle other requests in other threads. The main poller_output table, with some minor modifications, will be used to achieve RRDupdates.
gandalf wrote:Would RRDfile Update Groups equal Poller Groups? If not, what would be the criteria this time?
No.
gandalf wrote:The rrdfile data pipeline surely would be used for graphing. But a lot of plugins currently require access to rrd files as well.
Need feedback from Plugin developers as to "how" they would like this to work.
gandalf wrote:So there would be more than graphing only. When running from load-balanced http instances, either graph caching will fail or cache must reside on RRDfile Servers.
I expect the RRDtool Services to handle this. Each graph will know, in advance, which server it needs to talk to.

Regards,

TheWitness
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of MacTrack, Boost, CLog, SpikeKill, Platform RTM, DSStats, maintainer of Spine, lot's of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Central Plugin Repository
Central Templates Repository


I'm still out there people. Getting excited for Cacti 1.2. I think it will be a great release.

User avatar
gandalf
Developer
Posts: 22375
Joined: Thu Dec 02, 2004 2:46 am
Location: Muenster, Germany
Contact:

#4 Post by gandalf » Sun Dec 16, 2007 4:05 pm

TheWitness wrote:
gandalf wrote:It shows very nice where everything's happening from the data point of view. (BTB: I suppose, it would be better to either use poller_item _or_ poller_cache in picture and text. Personally, I prefer poller_item as this is the name of the real table.)
I litterally "slapped" this together. I will correct that in v2.
I suppose, separating data logic and workflow is the better way. Else I fear that the picture will become too crowdy.

From the current design, the database servers seem to define the limit of this architecture. While Poller and RRDfile Servers are scalable as well as http, database server is existing only once. As I understand, the second one is for failover only.
So, if there's a central poller_item table as well as poller_output, their update/delete performance will be cruical. I suppose, you're thinking of memory tables as boost uses them. And then, like boost does with rrdtool bulk update, there's the SQL bulk insert that will create some more preformance, correct?

Reinhard

User avatar
TheWitness
Developer
Posts: 14804
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

#5 Post by TheWitness » Sun Dec 16, 2007 4:31 pm

Yes, memory tables have a I/O rate in excess of 40k updates per second, so even though it uses a table lock mechanism, we are safe. I was thinking that making this a separate database altogether though would help other subsystems performance though, and simplify backup.

What do you think about that?

TheWitness
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of MacTrack, Boost, CLog, SpikeKill, Platform RTM, DSStats, maintainer of Spine, lot's of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Central Plugin Repository
Central Templates Repository


I'm still out there people. Getting excited for Cacti 1.2. I think it will be a great release.

melchandra
Cacti User
Posts: 311
Joined: Tue Jun 29, 2004 12:52 pm
Location: Indiana
Contact:

#6 Post by melchandra » Sun Dec 16, 2007 4:33 pm

Forgive me if this seems to be a silly question.

Is there some reason why the pollers don't directly update the RRD Files? Does it have to go through the database?

Is there a way for the Pollers to be sent both the host and oid information to query, as well as host information for the Remote RRD Update Service so they could query, and then pass the data directly to the RRD storage devices?
Dave

User avatar
Howie
Cacti Guru User
Posts: 5330
Joined: Thu Sep 16, 2004 5:53 am
Location: United Kingdom
Contact:

#7 Post by Howie » Sun Dec 16, 2007 4:33 pm

gandalf wrote:The rrdfile data pipeline surely would be used for graphing. But a lot of plugins currently require access to rrd files as well. So there would be more than graphing only. When running from load-balanced http instances, either graph caching will fail or cache must reside on RRDfile Servers. As there are two different pipes to RRDfile Servers, perhaps synchronization between updates and graph (rrdtool fetch) is necessary.
Once updates have been made scalable, is there a requirement for actual load-balancing of HTTP frontends? I can see why you might want HA, but does anyone really have so many concurrent users that a single server can't cope? All I ever see are queries about the polling limitations...

With a HA setup instead, then cache-sharing isn't really necessary. Really, if you have 1000 users all hitting the same graph then you are still reducing the number of graph-drawing operations from 1000 to 2, which is probably enough. If they aren't all hitting the same graphs, then graph caching isn't going to help anyway.

(I don't hit any of these limitations with my own modest needs, so I'm just curious really. Despite what it says under my name on the left, I'm just a lowly user that talks a lot :-) )
Weathermap 0.98 is out! & QuickTree 1.0. Superlinks is over there now (and built-in to Cacti 1.x).
Some Other Cacti tweaks, including strip-graphs, icons and snmp/netflow stuff.
(Let me know if you have UK DevOps or Network Ops opportunities, too!)

User avatar
TheWitness
Developer
Posts: 14804
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

#8 Post by TheWitness » Sun Dec 16, 2007 4:40 pm

Howie wrote:With a HA setup instead, then cache-sharing isn't really necessary. Really, if you have 1000 users all hitting the same graph then you are still reducing the number of graph-drawing operations from 1000 to 2, which is probably enough. If they aren't all hitting the same graphs, then graph caching isn't going to help anyway.
This sort of answers the "should we invent it outselves" question on load ballancing. HA for sure, but there are already technologies for that. So, I would suspect your answer would be leave that out of the design. The capability will be there for HA, but it's always optional.

TheWitness
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of MacTrack, Boost, CLog, SpikeKill, Platform RTM, DSStats, maintainer of Spine, lot's of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Central Plugin Repository
Central Templates Repository


I'm still out there people. Getting excited for Cacti 1.2. I think it will be a great release.

User avatar
TheWitness
Developer
Posts: 14804
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

#9 Post by TheWitness » Sun Dec 16, 2007 4:41 pm

melchandra wrote:Forgive me if this seems to be a silly question.

Is there some reason why the pollers don't directly update the RRD Files? Does it have to go through the database?

Is there a way for the Pollers to be sent both the host and oid information to query, as well as host information for the Remote RRD Update Service so they could query, and then pass the data directly to the RRD storage devices?
Yes, absolutely. When you have lot's of them, the disk i/o required is astounding. So, by batching them you can reduce I/O wait by 80-90% over time. So, the database provides that for us.

TheWitness
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of MacTrack, Boost, CLog, SpikeKill, Platform RTM, DSStats, maintainer of Spine, lot's of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Central Plugin Repository
Central Templates Repository


I'm still out there people. Getting excited for Cacti 1.2. I think it will be a great release.

ben_c
Cacti User
Posts: 203
Joined: Mon May 14, 2007 8:12 pm
Location: Melbourne, Australia.

#10 Post by ben_c » Sun Dec 16, 2007 4:47 pm

Great document guys, need a little bit more time to analyze it.

But it is the direction Cacti needs to start heading. I know the current limitations all too well from using it in a large enterprise (50,000+ data sources).

User avatar
Howie
Cacti Guru User
Posts: 5330
Joined: Thu Sep 16, 2004 5:53 am
Location: United Kingdom
Contact:

#11 Post by Howie » Sun Dec 16, 2007 4:49 pm

TheWitness wrote:This sort of answers the "should we invent it outselves" question on load ballancing. HA for sure, but there are already technologies for that. So, I would suspect your answer would be leave that out of the design. The capability will be there for HA, but it's always optional.
Indeed. I'd say stick to the architectural stuff required to support it (or at least not break it :-) ). HA solutions are usually either platform-specific (CARP, MS NLB, ultramonkey) or external (CSS, Alteon etc) anyway.
Weathermap 0.98 is out! & QuickTree 1.0. Superlinks is over there now (and built-in to Cacti 1.x).
Some Other Cacti tweaks, including strip-graphs, icons and snmp/netflow stuff.
(Let me know if you have UK DevOps or Network Ops opportunities, too!)

ben_c
Cacti User
Posts: 203
Joined: Mon May 14, 2007 8:12 pm
Location: Melbourne, Australia.

#12 Post by ben_c » Sun Dec 16, 2007 4:49 pm

TheWitness wrote:
Howie wrote:With a HA setup instead, then cache-sharing isn't really necessary. Really, if you have 1000 users all hitting the same graph then you are still reducing the number of graph-drawing operations from 1000 to 2, which is probably enough. If they aren't all hitting the same graphs, then graph caching isn't going to help anyway.
This sort of answers the "should we invent it outselves" question on load ballancing. HA for sure, but there are already technologies for that. So, I would suspect your answer would be leave that out of the design. The capability will be there for HA, but it's always optional.

TheWitness
I tend to agree, built in application level load balancing is never an ideal situation in my opinion. If you really need it, by some hardware to do the job properly (F5, alteon etc).

User avatar
Howie
Cacti Guru User
Posts: 5330
Joined: Thu Sep 16, 2004 5:53 am
Location: United Kingdom
Contact:

#13 Post by Howie » Sun Dec 16, 2007 4:56 pm

TheWitness wrote:
gandalf wrote:Where should hooks like "poller_bottom" be executed?
I need feedback from people like Howie and yourself to determine "where" that should be. So, see my comment above. It's an RFC you know ;)
If I understand the layout correctly, it would have to be on the 'master' poller, as the remote pollers report back in to that one. Of the 'big' plugins that I can think of that actually use cacti data * (reportit, thold, weathermap), thold and wmap can both already use poller_output instead of looking at rrd files directly. I don't really see any way around it for reportit though... if the rrdfiles were physically distributed onto different machines, then there would need to be some sort of aggregate-view or 'run this on all the rrd servers' API.

Actually thold probably would work either on the local pollers or the central location, since it works with one DS at a time, but it would be easier to maintain in the centre.

* I don't use Manage, MACTrack or Discovery, but as far as I know they don't deal with rrd data, do they?
Weathermap 0.98 is out! & QuickTree 1.0. Superlinks is over there now (and built-in to Cacti 1.x).
Some Other Cacti tweaks, including strip-graphs, icons and snmp/netflow stuff.
(Let me know if you have UK DevOps or Network Ops opportunities, too!)

User avatar
Alice
Cacti User
Posts: 107
Joined: Tue Oct 28, 2003 4:54 pm
Location: Bucharest, RO.

#14 Post by Alice » Sun Dec 16, 2007 8:21 pm

It's a little late, even for me, so please excuse all the aberations I'm about to write :)

WHY do we want multiple pollers?

a) Backup - have all the data available, even if one of the Servers is not available, or can't request data from all

the devices for some reason (network outage)
b) Speed - One server can't handle all the required devices
c) Combined (a+b)

IMHO it's a different approach for every one of the three situations listed.
I THINK that speed can be solved "local" (4 separate servers: Web, poller, DB and RRD Updater)

Facts: 10676 DS, 9742 graphs, 10669 RRD files totalling 1.3GB
Polling time: ~15 seconds
Two servers, one DB, one for the rest

Backup, on the other hand, is something else.

What if we run 2 somehow completly "independent" cacti instances, both querying the same hosts, and somehow

syncronize them?
Something like replication for MySQL (I really don't know how this works) and RSNYC for RRD files?

Hmm, nice shit :D
Error in posting

DEBUG MODE

SQL Error : 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server

version for the right syntax to use near 'WHERE forum_id = 7' at line 3

UPDATE forums SET forum_posts = forum_posts + 1, forum_last_post_id = WHERE forum_id = 7

Line : 423
File : functions_post.php
[url=http://www.x-graphs.com/]http://www.x-graphs.com[/url] [color=red]X[/color]-[color=blue]graphs[/color] :: All kind of graphs

User avatar
TheWitness
Developer
Posts: 14804
Joined: Tue May 14, 2002 5:08 pm
Location: MI, USA
Contact:

#15 Post by TheWitness » Sun Dec 16, 2007 10:37 pm

Yea, you get that when you sit on a post too long. Back button, copy, back button, repost, paste, post.

TheWitness
True understanding begins only when we realize how little we truly understand...

Life is an adventure, let yours begin with Cacti!

Author of MacTrack, Boost, CLog, SpikeKill, Platform RTM, DSStats, maintainer of Spine, lot's of unpublished work and most of Cacti's bugs.
_________________
Official Cacti Documentation
GitHub Repository with Supported Plugins
Central Plugin Repository
Central Templates Repository


I'm still out there people. Getting excited for Cacti 1.2. I think it will be a great release.

Post Reply