Site menu:

An IRC discussion of Runnels semantics.

(17:17:14) laen: I have a dream of having my events be tagged, and having certain programs subscribe to certain tags. Tags like "availability" would be added my "system startup" "system shutdown" events, and the monitoring system would get them, along with the paging system.
(17:18:47) lkanies: laen: i was just thinking about the fact that we haven’t talked about semantics at all, and semantics are, um, important :)
(17:19:58) fishzle [n=fi@60-241-49-66.tpgi.com.au] entered the room.
(17:20:21) laen: Yeah, it was your "definitions" page that reminded me about tagging.
(17:20:32) lkanies: i like the phrase "runnels bus"
(17:21:40) lkanies: https://reductivelabs.com/projects/runnels
(17:21:45) lkanies: ok, what do people think?
(17:21:49) lkanies: i know it can be modified and stuff
(17:21:55) lkanies: how’s that for a first draught
(17:22:12) icblenke: tastes good.
(17:22:22) laen: less filling.
(17:22:53) lkanies: :)
(17:24:02) lkanies: i’ve preemptively registered #runnels on freenode
(17:24:16) lkanies: so i don’t have the problems i’m having with #puppet, where people are already on it and thus i can’t register it
(17:25:08) laen: Does this: "is that the routers contain the organizational metadata for the system, not the producers or consumers. " mean that you don’t see the producers being able to add metadata to the message? Or just that it won’t make any decisions based on that metadata?
(17:26:28) lkanies: they won’t contain metadata related to the flow of messages
(17:26:38) lkanies: i.e., i message cannot specify what its destination is
(17:27:22) laen: Ahh, okay.
(17:27:38) lkanies: i’ll clarify it
(17:28:15) laen: ..But they could add other arbitrary tags, like "sysup" to identify that it’s an event related to a system starting up?
(17:28:37) laen: Oh, there’s your section on tags.
(17:28:43) laen: I hadn’t seen it yet.
(17:28:46) icblenke: there is a potential problem there: what happens when a consumer isn’t running? are the messages lost?
(17:29:03) laen: I think they queue up.
(17:29:11) icblenke: do they?
(17:29:28) lkanies: ok
(17:29:35) icblenke: you might have multiple consumers. how do you know that all of the consumers are there ready to consume the messages you have to deliver?
(17:29:40) laen: In my head, I’m thinking of kind of a "del.icio.us" model, except with events.
(17:30:00) lkanies: icblenke: if a consumer has a subscription but is not up, then the messages should queue
(17:30:18) lkanies: obviously within reason; we don’t want 10^50 messages in the queue or whatever
(17:30:26) icblenke: oh, so subscriptions are persistent, even when the consumer isn’t "running"
(17:30:27) lkanies: laen: to some extent, yeah
(17:30:29) laen: So, you have producers, sending in events, adding tags. You have consumers/correlators, that subscribe to events and perhaps generate new events based on that..
(17:30:31) lkanies: i would think so
(17:30:43) icblenke: laen: right.
(17:30:49) lkanies: otherwise the router would have to retain a persisten connection to all consumers at all times
(17:31:13) icblenke: laen: the consumers subscribe with a pattern that matches messages they’re looking to consume.
(17:31:17) laen: (I’m also kind of thinking of this in the XMPP pub/sub type thing)
(17:31:36) lkanies: and there are also things like downstream subscriptions: consumer A is subscribed to all "metric" events and is connected to router A; producer B produces metric events and is connected to router B
(17:31:47) laen: icblenke: You think so? I think there need to be correlators that add the appropriate tags for the consumers to subscribe to.
(17:31:55) lkanies: router B needs to know about consumer A’s subscription so that it forwards things correctly
(17:32:08) lkanies: laen: what do you mean?
(17:32:13) icblenke: laen: your "correlators" are nothing more than consumers/producers themselves.
(17:32:19) laen: icblenke: True!
(17:32:39) laen: They consume packets, add a tag, and stick them back in the queue.
(17:32:42) icblenke: right!
(17:32:51) icblenke: well, re-publish them.
(17:32:54) laen: Yeah.
(17:33:14) laen: (Without changing the UUID of the message)
(17:33:15) icblenke: they can also convert them to other message "types" to be forwarded along to whatever consumer subscribes to that message type
(17:33:34) lkanies: please provide an example of a post-producer tag
(17:33:35) icblenke: wouldn’t that be a different message at that point?
(17:33:37) icblenke: I’d think so.
(17:34:04) lkanies: it would result in consumers getting duplicate messages, those with tags and those without
(17:34:12) icblenke: I’m already thinking about ways to write this with reliable-msg
(17:34:13) lkanies: that is, they’d get the pre- and post-tag messages
(17:34:26) laen: Yeah, hrm.. I think messages need a UUID so that doesn’t happen.
(17:34:29) lkanies: why wouldn’t the producers just tag the messages?
(17:34:34) laen: I guess I want a special tagging consumer?
(17:34:38) lkanies: laen: yeah, definitely; i mention that
(17:34:44) laen: Well, because the producer doesn’t know everything about the message..
(17:34:49) laen: I want other parts of the process to be able to tag..
(17:34:57) icblenke: think pipes.
(17:35:00) lkanies: hmm
(17:35:13) icblenke: messages feed into producers and out of consumers in a message pipeline
(17:35:32) lkanies: so do we need a way to specify that a message is in a certain point in a pipeline? e.g., "these consumers only get tagged messages, and X consumer tags them"
(17:35:39) laen: Hrm.
(17:35:48) icblenke: a filter that adds a tag is just another piece in the pipeline.
(17:35:54) laen: Yeah, there we go. A filter.
(17:35:59) lkanies: we don’t really have a pipeline right now, tho
(17:36:22) icblenke: each "filter" just marks up the message enough to match the subscribe pattern of the next pipe
(17:36:27) laen: A message comes in, goes through some number of filters, and into the router, where subscriptions are processed and the message is distributed in a spoke pattern.
(17:36:44) icblenke: oh, no, you’re missing the idea
(17:36:45) laen: (And filters are just regexs that add tags?)
(17:36:46) lkanies: so we need the ability to stick filters between producers and routers?
(17:36:55) lkanies: i don’t like that
(17:37:03) laen: Yeah, I kinda don’t either.
(17:37:04) icblenke: I say that a filter is just another producer/consumer.
(17:37:10) lkanies: it implies 1-to-1 filter/producer ratios
(17:37:17) lkanies: icblenke: i agree, but i don’t see how it would work
(17:37:36) laen: (Do other people like the del.icio.us analogy?)
(17:37:45) icblenke: a "filter" could add tags that allow another subscribe pattern to match and produce a message to that end, continuing the pipeline.
(17:37:50) lkanies: laen: to some extent, but the uses are so different
(17:38:06) lkanies: and delicious hasn’t done a great job of really reusing all that metadata, and it requires that users add the data
(17:38:09) lkanies: there’s no filter there
(17:38:15) laen: Producers add data to the system, with some tags, and subsribers subscribe to certain tags..
(17:38:21) lkanies: icblenke: but where is the pipeline?
(17:38:39) laen: People (consumers) read del.icio.us, read a message, and go "oh!" and add some other tags to it.
(17:38:48) laen: Other people (consumers), subscribe to those tags.
(17:39:17) lkanies: hrm
(17:39:33) lkanies: that’s not how i* use delicious ;)
(17:39:56) icblenke: the pipeline is how each "filter" adds a tag that allows another subscribe pattern to catch the output
(17:39:59) lkanies: i just add my own links and put the tags on there when i add new links
(17:40:00) laen: Some people (consumers) just watch everything that comes through, read, and tag.
(17:40:00) laen: Each event receives a UUID as it enters the system, and gets progressively more tags attached to it.
(17:40:16) icblenke: "color white" -> filter( "shape round") -> "color white; shape round" => filter(laces threaded) => "color white; shape round; laces threaded"
(17:40:31) lkanies: icblenke: but what about cases where a consumer catches a message both before and after tags, and it can understand both versions but prefers the extra data?
(17:40:33) icblenke: get where I’m going?
(17:40:59) icblenke: what would the router do?
(17:41:00) lkanies: yeah, but i don’t think you get a consistent system that way
(17:41:06) icblenke: ah.
(17:41:09) lkanies: i think the filters have to be inline
(17:41:23) laen: Hmm.. Yeah. I might be off here.
(17:41:26) icblenke: I don’t think so.
(17:41:32) lkanies: i think the router needs to have filters configured
(17:41:35) icblenke: I think you can keep consistency quite well…
(17:42:02) lkanies: and with your thing, all filter subscriptions have to specify that they are not interested in already processed messages
(17:42:10) icblenke: "step = one" -> filter( "step = 2" ) -> "step = two" -> filter( "step = three" ) => "step = 3"
(17:42:14) lkanies: so the messages need to be marked with what filters processed them
(17:42:21) icblenke: well, you get what I mean.
(17:42:28) laen: Well, they have UUIDs, and a filter would never get the same UUID twice.
(17:42:43) laen: The router would never deliver the same UUID to the same consumer more than once.
(17:42:51) icblenke: lkanies: I’m just trying to keep with the publish/subscribe model where the subscriptions are a pattern.
(17:43:14) lkanies: yeah, i understand and appreciate that, but it just seems like the filters have to be inline
(17:43:24) lkanies: or at least that there needs to be a special filter hook in routers
(17:43:27) icblenke: keeping the patterns from intersecting might be a problem if you’re not careful. things could explode into an infinite recursion.
(17:43:35) lkanies: yeah, exactly
(17:43:40) lkanies: i think filters must be one-time
(17:43:50) lkanies: the bus has one opportunity to filter a message
(17:44:01) lkanies: anything else is just too bloody complicated, if nothing else
(17:44:17) lkanies: and frankly, at this point i think we’re talking about implementation
(17:44:30) icblenke: well, almost.
(17:44:36) lkanies: "i use a shitty application that doesn’t automatically tag my messages, so i want a filter to do it for me"
(17:44:38) icblenke: it’s still high level here.
(17:44:56) lkanies: i still think semantic tagging starts with the producers
(17:45:01) icblenke: that really sounds like a plugin for that producer.
(17:45:09) icblenke: that doesn’t sound like a function of the bus.
(17:45:11) icblenke: to be honest.
(17:45:21) laen: I think so too, but I think a UUID should be able to receive more tags along the way..
(17:45:21) icblenke: the bus is about delivery.
(17:45:24) lkanies: either it’s the native application and it much better understands the messages, or it’s a special plugin built to understand the message
(17:45:38) lkanies: laen: please provide a concrete example of where that would be useful
(17:45:39) laen: icblenke: Doesn’t that imply distributing a bit more intelligence to the producers?
(17:45:47) lkanies: i don’t think so
(17:46:05) laen: Okay, let’s see…
(17:46:39) laen: I have cluster A and cluster B. They’re both monitored by different nagios hosts.
(17:47:26) laen: Cluster node A1 sends a message tagged "sysreboot", indicating that it just booted up. It gets put in the routing database.
(17:47:27) lkanies: we can come back to it later, but my puppet messages will all have the path to their source (e.g., /solaris/web server/apache/package=apache), and i’d like something like that to be a standard
(17:47:45) lkanies: "here’s the object that threw the message" and "here’s where that object fits into the configuration"
(17:47:55) icblenke: the bus should be as simple as possible.
(17:47:56) lkanies: kind of like an ability to provide a URI to get back to the message source
(17:48:02) lkanies: i definitely agree with that
(17:48:05) laen: I then have a producer that reads all "sysreboot" messages, and tags them with "clustera" or "clusterb". My nagios server subscribes to "clustera"+"sysreboot".
(17:48:22) lkanies: laen: if your clusters don’t know that they’re clusters, um, you’re fired :)
(17:48:49) laen: My reboot script doesn’t know that. All it knows is that the system was rebooted.
(17:48:59) lkanies: the whole point of semantic labelling is that you’ve already explained all this crap to your computers or they wouldn’t work
(17:49:55) lkanies: what reboot script? the one that initiated the reboot, or the one that got called upon reboot and produced the event, or the one that converted the event to a message?
(17:50:20) icblenke: the latter. the producer. that’s where you would tag, I would think.
(17:50:23) laen: The one that got called upon reboot and produced the event.
(17:50:30) laen: Hmm, okay.
(17:50:38) lkanies: i think this is where something like facter would come in: you’ve got a bunch of tags you want to associate with all outgoing messages; facter could be used to collect those tags
(17:50:47) laen: Ahh! True.
(17:51:00) icblenke: interesting.
(17:51:02) lkanies: just teach facter how to figure out what the cluster name, and you’re done
(17:51:09) lkanies: cluster name, cluster type, etc.
(17:51:39) lkanies: and teach puppet about the functions that the cluster provides, and suddenly you’ve got tons of semantic labelling
(17:52:17) laen: Hmm.. So in that world, the host knows everything about itself?
(17:52:26) lkanies: yep; craziness, huh?
(17:52:28) lkanies: :)
(17:53:09) lkanies: this is the whole point: build simple self-aware systems that share as much semantic information as they can and then allow other systems to pick out what they want based on that semantic data
(17:53:53) lkanies: you could define service levels on your network and have a special consumer that automatically pages people when any error messages come from hosts in the highest service levels
(17:53:58) lkanies: it would be silly, but you could do it :)
(17:54:15) lkanies: but the hosts should know their own service levels
(17:56:03) lkanies: ...no response to that?
(17:56:35) laen: (Sorry, people in my cubicle)
(17:56:50) lkanies: ah
(17:57:11) fishzle: consumers know the service levels
(17:57:15) fishzle: not providers
(17:57:43) lkanies: how would the consumers know?
(17:57:50) lkanies: you would have to explain it to them
(17:57:56) fishzle: how would the providers know?
(17:58:01) lkanies: but you’ve already explained it to the individual hosts, when you configured them
(17:58:12) fishzle: not the service levels
(17:58:13) mattimustang left the room (quit: "Trillian (http://www.ceruleanstudios.com").
(17:58:33) laen: What do you mean by "service levels" ?
(17:58:34) lkanies: puppet will produce configurations that vary based on service levels (prod, test, dev, etc.), so it will already have to know the service level
(17:58:43) laen: Oh, I see.
(17:58:52) fishzle: aah, I thought we were talking about the paging
(17:58:58) lkanies: or anything else you want: operating system, ip address, lan segment, data center
(17:59:03) lkanies: fishzle: i was mostly being silly there
(17:59:08) laen: If the host knows that it’s QA, is it really QA?
(17:59:19) laen: (Since that’s a difference between QA and production)
(17:59:25) icblenke: I’m thinking "service levels" as "this server has had 99.999% uptime" or "you have had 5million apache hits this month"
(17:59:48) fishzle: the provider doesn’t care about those
(17:59:53) fishzle: it’s just reporting about itself
(18:00:04) icblenke: the consumer would concern itself with such things.
(18:00:05) fishzle: it’s the other apps, the nagios, the openviews, that will have the thresholds set
(18:00:22) lkanies: icblenke: yeah, that’s a reporting thing more than anything else
(18:00:46) fishzle: I expect that if we have a bus, then nagios,rt, etc will be able to communicate to the host that it has exceeded a threshold (if the host doesn’t do it itself)
(18:00:55) lkanies: laen: well, if a host thinks it’s QA in puppet, then puppet will configure it to act like a QA host
(18:01:13) laen: Oh, in cases where QA is configured differently than production?
(18:01:14) lkanies: yeah, feedback loops will definitely sit on the bus
(18:01:19) lkanies: laen: exactly
(18:01:26) icblenke: laen: you might even have an SEC in the middle there for correlation.
(18:02:01) fishzle: SEC?
(18:02:04) icblenke: converting from syslog and back again to use SEC might be silly, but it’s an engine nonetheless.
(18:02:13) laen: icblenke: Yeah.. I hate the SEC language, but that’s part of what I’m getting at.. An event would be tagged differently, depending on other events around it (by a correlator consumer).
(18:02:21) laen: fishzle: "Simple Event Correlator"
(18:02:42) icblenke: (18:03:53) icblenke: there’s an article in the latest Short Topics in System Administration.
(18:04:42) laen: It seems to me that there’ll be some information that the producer won’t have that a consumer will have.
(18:05:07) laen: s/will/could/
(18:05:28) icblenke: right, the intelligence of what to do with the messages.
(18:05:31) fishzle: very much so
(18:05:55) icblenke: a producer just needs to pass along what information about an event it can from whatever source it is gathering it.
(18:06:39) icblenke: I’m still concerned a bit with lost messages, but it sounds like we’re largely ignoring the sequence number as having any importance.
(18:07:13) fishzle: that’s a tradeoff in any environment/system
(18:07:23) fishzle: reliable transfer has costs
(18:07:57) icblenke: if a message isn’t received by a producer in a given timeout value (given a periodic event that should happen all the time), it’s up to the consumer to throw an alert message to the bus then?
(18:08:09) icblenke: sorry, if a message isn’t received by a consumer, rather.
(18:08:12) fishzle: besides, a "message not received" can be simulated with a sub/unsub/sub operation
(18:08:30) laen: Hey, maybe we should talk a bit about what our producers are, and what our consumers are, so we can think a bit about what information the consumers need.
(18:08:37) fishzle: that alert method is unpleasant
(18:08:47) icblenke: unpleasant, but necessary.
(18:08:58) fishzle: that presupposes that producers know exactly who the consumers are
(18:09:17) fishzle: so, there would need to be some sort of registration service as well
(18:09:23) lkanies: (on the phone but…) i’m not convinced that consumers will necessarily be able to use sequence numbers to determine messages are missing
(18:09:28) lkanies: how will a consumer ever know one is missing?
(18:09:38) laen: I don’t mean it that way.. I’m thinking "in our environments today, what do we want producing events, and what do we have that we want to receive those events."
(18:09:43) icblenke: that presupposes that consumers know exactly what type of message the producers should be providing on an ongoing periodic basis.
(18:09:51) fishzle: ugly
(18:09:53) icblenke: lkanies: without a sequence number, it won’t.
(18:10:05) fishzle: even with a sequence number it’s not a fix
(18:10:05) icblenke: lkanies: and re-sends aren’t possible in this model either.
(18:10:05) lkanies: even with a sequence number how could they?
(18:10:20) fishzle: the producer could send message1 about the clock update
(18:10:21) lkanies: unless they are guaranteed that they’re subscribed to every single message
(18:10:26) icblenke: unless the producer/consumer work out a message passing method to do that synchronization
(18:10:37) lkanies: i say: ignore all that for now
(18:10:48) laen: luke: I agree…
(18:10:50) icblenke: we can add that on afterward.
(18:11:06) fishzle: you could ask for an incrementing sequence number
(18:11:12) fishzle: that would prevent receiving old messages
(18:11:18) laen: I think ordered, reliable delivery is the responsibility of the transport..
(18:11:27) fishzle: that would be about the only thing you could guarantee
(18:11:59) icblenke: laen: and if the transport doesn’t deliver messages for some period of time… you’d need to alert on the consumer for that.
(18:12:11) icblenke: if the consumer knew it was supposed to receive something within a window of opportunity.
(18:12:37) fishzle: are we sure keepalives should be using this mechanism?
(18:12:45) icblenke: not keepalives.
(18:12:48) icblenke: I have netflow probes.
(18:12:56) icblenke: every minute, they rotate their logs.
(18:13:16) icblenke: I want to know when I haven’t received any log messages for a 5 minute period. that’s a consumer logic thing.
(18:13:25) fishzle: yes, but not a transport thing
(18:13:46) fishzle: you will need higher layer processing to keep timers
(18:13:48) icblenke: the transport is down. but it affects the consumer.
(18:13:48)
laen starts a list of producers and consumers.
(18:13:52) icblenke: right.
(18:14:08) laen
adds "syslog" (which is actually a lot of producers), and "netflow probes" to the list of producers.
(18:14:29) icblenke: add nagios alerts.
(18:14:39) fishzle: should store the list on madstop.com so we don’t lose sight of what the goals are
(18:14:42) icblenke: and SAR stats.
(18:15:00) laen: fishzle: Yeah, true.
(18:15:19) fishzle: I thought we weren’t going to use this to do stats?
(18:15:27) fishzle: and metrics
(18:15:40) icblenke: it’s a transport.
(18:15:42) laen: Well, that’s a type of message that could be sent..
(18:15:55) laen: "load_one: 0.15" is a valid message..
(18:15:55) fishzle: hmm
(18:16:21) icblenke: missing an event is just as important as missing some chunk of statistics.
(18:16:23) fishzle: I’m sure we didn’t want to build a general purpose message passing system
(18:16:24) laen: So is "sshd started"
(18:16:25) icblenke: usually, they’re correlated.
(18:17:00) fishzle: I understand we want info out, but
(18:17:29) fishzle: I’m thinking that some of these things will make it a very heavy protocol
(18:17:42) fishzle: with ACKs, NAKs, retransmits, etc
(18:17:48) fishzle: hmm
(18:19:00) laen: Yeah, that’s a concern.
(18:19:23) fishzle: "<laen> I think ordered, reliable delivery is the responsibility of the transport.."
(18:19:37) fishzle: is where you are
(18:19:47) fishzle: I think a more UDP like model would be simpler
(18:19:52) laen: (In my mind, I’m using XMPP as the transport protocol, until we decide on something else..)
(18:20:07) fishzle: other apps can build reliability or ordering as they like
(18:20:14) fishzle: brb, gotta go to the postoffice
(18:20:35) laen: (Others may have other ideas on what that transport looks like..)
(18:20:51) lkanies: laen: if you email me the list i’ll stick it up there
(18:21:58) lkanies: i’m about to head out and won’t be back until late, but keep it up :)
(18:24:05) icblenke: laen: yeah, I’m thinking XMPP too.
(18:24:59) lkanies: i’m cool with that
(18:28:42) laen: luke: Mailed.
(18:29:21) laen: For everyone else:
(18:29:24) laen: Producers:
(18:29:24) laen: Syslog (this is really a lot of producers, sshd, imapd, sendmail, etc..)
(18:29:24) laen: Netflow probes
(18:29:24) laen: Nagios alerts
(18:29:24) laen: SAR stats
(18:29:25) laen: Cron job execution status
(18:29:27) laen: Consumers:
(18:29:29) laen: Statistics systems (netflow receivers, sar data collectors, cacti)
(18:29:31) laen: Monitoring systems (nagios, openview, etc)
(18:29:33) laen: Security monitors (logsurfer)
(18:29:35) laen: Alerting system
(18:29:37) laen: Problem tracking system (RT, Trac, etc)
(18:32:25) lkanies: i just asked about doing WIPs on runnels and puppet at LISA
(18:33:53) lkanies: rather, i emailed the wips address about it
(18:40:04)
laen envies the LISA-goers.
(18:40:27) lkanies:
https://reductivelabs.com/projects/runnels/examples/view
(18:40:29) lkanies: ok, i’m off
(18:40:30) lkanies: ttyl
(18:40:33) laen: Bye!
(18:40:37) icblenke: ciao!
(18:48:59) laen: Okay, my "runnels as delicious for events" idea is bad on some other levels, as I envision it.
(18:50:11) laen: For example: It means the queue would grow indefinitely, without some sort of expiration process.
(18:50:35) laen: On the other hand, I’d really like to have a database of all events that I’ve passed.
(18:51:45) laen: (Well, not all but "some time period’s worth")
(18:52:32) fishzle left the room (quit: Read error: 104 (Connection reset by peer)).
(18:52:34) laen: ..And it would be neat to be able to write little scripts that subscribe to some combination of tags, and get the last 50 events that match ,or somesuch..
(19:02:28) laen
-> home.
(21:37:02) fishzle [n=fi@60-241-49-66.tpgi.com.au] entered the room.
(21:38:09) fishzle: (21:38:17) fishzle: here’s some other logging system
(21:38:26) fishzle: older, but with analysis
(21:46:15) fishzle: actually, it’s too old
(00:09:10) lkanies:
https://reductivelabs.com/projects/runnels/model
(00:11:44) lkanies: and now, off to bed

$Id: semantics.page 6 2006-07-02 20:05:14Z luke $*