Puppet: System Administration Automated

Support

Ticket #1131 (closed defect: fixed)

Opened 8 months ago

Last modified 8 months ago

puppetmaster memory leak

Reported by: descala Assigned to: adamhjk
Priority: normal Milestone: 0.24.4
Component: library Version: 0.24.2
Severity: normal Keywords:
Cc: Triage Stage: Accepted
Attached Patches: None Complexity: Unknown

Description

afert upgrading from 0.23.2 to 0.24.2 puppetmaster eats up server memory.

it logs

Puppet (err): Could not call: Cannot allocate memory - fork(2)

until restarted or crash.

Attachments

puppetmaster-memory-week.png (62.1 kB) - added by descala on 03/12/08 17:57:18.
memory-month-puppet.png (62.2 kB) - added by adamhjk on 03/14/08 01:42:10.
A memory graph showing the puppet master growth, leading up to #1131
diff (327 bytes) - added by descala on 03/14/08 19:53:34.
lib/puppet/util/loadedfile.rb
memory_profiler.rb (2.8 kB) - added by luke on 03/17/08 00:06:23.
A library you can load to profile memory growth somewhat.

Change History

03/12/08 17:57:18 changed by descala

  • attachment puppetmaster-memory-week.png added.

03/13/08 15:15:44 changed by luke

  • stage changed from Unreviewed to Accepted.
  • milestone set to 0.24.3.

Is there any more information you can give about your environment? Lots of file serving? None? How often does the memory spike happen? How quickly?

03/13/08 21:00:41 changed by descala

  • server manages 40 nodes
  • nodes located in other networks, behind NAT
  • apache2 in front of mongrel+puppet
  • serving only some configuration files. not lots of file serving.
  • brings Fedora 5 on 2GB RAM CPU x4 Xeon to its knees after 4 / 5 days. we are now restarting it daily
  • rubyversion 1.8.6, puppetversion 0.24.2
  • use of external_nodes
  • external_nodes script calls wget which retrives node info from rails app

May this be related to #1130 ?

(follow-up: ↓ 4 ) 03/13/08 21:14:29 changed by Fujin

Possibly related to http://reductivelabs.com/trac/puppet/ticket/1010 - only one of the 'side effects' of memory leaking was fixed, not the cause;

(in reply to: ↑ 3 ; follow-up: ↓ 5 ) 03/13/08 22:11:11 changed by descala

yes. it also replaced some files with empty directories in one of its crisis.

(in reply to: ↑ 4 ; follow-up: ↓ 6 ) 03/13/08 23:26:21 changed by adamhjk

Replying to descala:

yes. it also replaced some files with empty directories in one of its crisis.

Did it do that under 0.24.2? If, so, we have a larger problem on our hands, since that means the work-around that caused #1010 to be closed doesn't work.

(in reply to: ↑ 5 ) 03/14/08 00:15:04 changed by descala

Replying to adamhjk:

Replying to descala:

yes. it also replaced some files with empty directories in one of its crisis.

Did it do that under 0.24.2? If, so, we have a larger problem on our hands, since that means the work-around that caused #1010 to be closed doesn't work.

yes. 0.24.2 on server and clients.

after puppetmaster restart got this kind of report in 4 nodes out of 40. puppet trying to set files right:

Wed Mar 12 09:30:07 +0100 2008 //rubygems/File[rubygems_fact] (notice): Recursively backing up to filebucket
Wed Mar 12 09:30:07 +0100 2008 //rubygems/File[rubygems_fact] (notice): Not removing directory; use 'force' to override
Wed Mar 12 09:30:07 +0100 2008 //rubygems/File[rubygems_fact] (notice): Recursively backing up to filebucket
Wed Mar 12 09:30:07 +0100 2008 //rubygems/File[rubygems_fact] (notice): Not removing directory; use 'force' to override
Wed Mar 12 09:30:07 +0100 2008 //rubygems/File[rubygems_fact] (err): Could not rename tmp /usr/local/lib/ruby/site_ruby/1.8/facter/rubygems_version.rb for replacing: Is a directory - /usr/local/lib/ruby/site_ruby/1.8/facter/rubygems_version.rb.puppettmp or /usr/local/lib/ruby/site_ruby/1.8/facter/rubygems_version.rb

no files were corrupted. only changed to an empty directory.

03/14/08 01:07:52 changed by adamhjk

I tried to re-open #1010 in reference to this bug, but I can't seem to figure out where in the Trac UI one would go about doing that. If #1010 is not fixed, people should be advised to use 0.23.2 until we can validate that corrupted file resources will no longer be utilized on the client.

Are we checksum-ing the corrupted resources?

Either this bug or a re-opened #1010 should have its priority raised to highest, and severity bumped to critical. If we can't re-open #1010, it should be this one.

03/14/08 01:26:22 changed by luke

Anyone want to volunteer to help debug the problem, since no one suffering it has attempted to study it, from what I can see, and I can't reproduce it.

03/14/08 01:42:10 changed by adamhjk

  • attachment memory-month-puppet.png added.

A memory graph showing the puppet master growth, leading up to #1131

03/14/08 01:51:35 changed by shadoi

Here's a curious log snippet from a machine that got corrupted files (client and server both running 0.23.2):

Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[os]/Node[os::redhat]/File[/etc/filter-syslog/redhat]/source (err): Could not describe /os/Redhat/filter-syslog: can't modify frozen object
Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[newsyslog]/File[/etc/filter-syslog.conf]/source (err): Could not describe /newsyslog/filter-syslog.conf: wrong status line: "<!DOCTYPE HTML PUBLIC \"-//IETF//DTD HTML 2.0//EN\">"
Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[postfix]/File[/etc/filter-syslog/postfix]/source (err): Could not describe /postfix/filter-syslog: wrong status line: "<html><head>"
Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[afs]/File[/usr/vice/etc/CellServDB]/source (err): Could not describe /afs/CellServDB: wrong status line: "<title>400 Bad Request</title>"
Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[afs]/File[/etc/filter-syslog/afs]/source (err): Could not describe /afs/filter-syslog: wrong status line: "</head><body>"
Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[ntp]/File[/etc/ntp/ntpservers]/source (err): Could not describe /ntp/ntpservers: wrong status line: "<h1>Bad Request</h1>"
Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[os]/Node[os::redhat]/File[/etc/profile.d/usrlocal.sh]/source (err): Could not describe /os/Redhat/profile.d/usrlocal.sh: wrong status line: "<p>Your browser sent a request that this server could not understand.<br />"
Tue Mar 11 14:29:51 -0700 2008 /Main[top]/Node[basenode]/Node[ntp]/File[/etc/filter-syslog/ntp]/source (err): Could not describe /ntp/filter-syslog: wrong status line: "Reason: You're speaking plain HTTP to an SSL-enabled server port.<br />"
Tue Mar 11 14:29:52 -0700 2008 /Main[top]/Node[basenode]/Node[tmpclean]/File[/etc/cron.daily/tmpwatch]/source (err): Could not describe /tmpclean/tmpwatch: wrong status line: "Instead use the HTTPS scheme to access this URL, please.<br />"
Tue Mar 11 14:29:52 -0700 2008 /Main[top]/Node[basenode]/Node[virt_all_users]/Node[client_gsbweb]/Remotefile[/usr/share/ssl/openssl.cnf]/File[/usr/share/ssl/openssl.cnf]/source (err): Could not describe /dist/clients/gsbweb/usr/share/ssl/openssl.cnf: wrong status line: "<blockquote>Hint: <a href=\"https://henson.stanford.edu:8140/\"><b>https://henson.stanford.edu:8140/</b></a></blockquote></p>"
Tue Mar 11 14:29:53 -0700 2008 /Main[top]/Node[basenode]/Node[gsb-vatuqa]/Node[client_gsbweb_gsb-vatuqa]/Node[client_gsbweb_gsb-vatu-common]/Remotefile[/etc/newsyslog.monthly/tomcat]/File[/etc/newsyslog.monthly/tomcat]/source (err): Could not describe /dist/clients/gsbweb/etc/newsyslog.monthly/tomcat.gsb-vatuqa: wrong status line: "<hr>"
Tue Mar 11 14:29:53 -0700 2008 /Main[top]/Node[basenode]/Node[postfix]/Postmap[/etc/postfix/transport]/Remotefile[/etc/postfix/transport]/File[/etc/postfix/transport]/source (err): Could not describe /postfix/transport: wrong status line: "<address>Apache/2.2.3 (Debian) proxy_html/2.5 mod_ssl/2.2.3 OpenSSL/0.9.8c WebAuth/3.5.3 Server at henson.stanford.edu Port 8140</address>"
Tue Mar 11 14:29:53 -0700 2008 /Main[top]/Node[basenode]/Node[user_root]/Remotefile[/root/.bashrc]/File[/root/.bashrc]/source (err): Could not describe /dist/users/rootuser/bashrc: wrong status line: "</body></html>"
Tue Mar 11 14:29:55 -0700 2008 /Main[top]/Node[basenode]/Node[xinetd]/Service[xinetd] (notice): Triggering 'refresh' from 2 dependencies
Tue Mar 11 14:30:03 -0700 2008 /Main[top]/Node[basenode]/Node[virt_all_users]/Node[client_gsbweb]/Node[apache]/Exec[apache-reload] (notice): Triggering 'refresh' from 1 dependencies
Tue Mar 11 14:30:05 -0700 2008 /Main[top]/Node[basenode]/Node[iptables]/Exec[rebuild_iptables] (notice): Triggering 'refresh' from 4 dependencies

Specifically it looks like things start going south with "can't modify frozen object", maybe a red herring?

03/14/08 19:53:34 changed by descala

  • attachment diff added.

lib/puppet/util/loadedfile.rb

03/16/08 22:15:23 changed by luke

  • milestone changed from 0.24.3 to elmo.

I spent all day Friday and Saturday trying to reproduce this and was not able to.

If you would like to do stand-alone testing, it's very easy using tools I've already written (ext/puppet-test, in the repository. You can specify as many repetitions as you want with --repeat N, and as many forks as you want with --fork (although I've had more luck using separate processes).

It accepts any normal Puppet argument (e.g., server, certname).

To test configuration compiling, use:

sudo puppet-test

To test file retrieving, use:

# Note that you should leave off the "puppet://" here.
sudo puppet-test --retrieve "/path/to/file"

To test file describing, use:

sudo puppet-test --describe "/path/to/file"

You can peruse the file and see what other suites are available. I'll be making a commit today that adds the ability to repeat the server-side tasks, like compiling and parsing.

03/17/08 00:06:23 changed by luke

  • attachment memory_profiler.rb added.

A library you can load to profile memory growth somewhat.

03/17/08 00:10:43 changed by luke

I've added a library that I was using to try to track memory growth. To use it, stick it in your RUBYLIB somewhere, then add this to your puppetmasterd:

require 'memory_profiler'
MemoryProfiler.start(:deltas => false)

Add it around line 281 -- right before start/join stuff done on webrick/mongrel.

This file will create /tmp/memory_log and update a file in there with info on what instances it thinks are being retained across GC runs. Basically, you get a ratio of current instances to the number of those that were also present 10 seconds ago. It's sorted by total instances, so String always shows up first, but you should be able to see anything that grows quickly based on memory leakage. (Assuming the memory leakage is of Ruby objects, not leakage in Ruby C libraries.

03/17/08 00:57:53 changed by Fujin

Thanks for that Luke - I've got your memory profiler loaded on a mongrelized puppetmaster runing inside a screen session with --trace now as per our IRC conversation. Version is 0.24.1. Memory is climbing slowly. When It gets upwards of 250MB~ is when I start seeing issues, so I'll paste up the memory profiler stuff then.

03/17/08 03:16:19 changed by Fujin

14:56 < fujin> lak: 180MB ram usage 14:56 < fujin> Top 20 retained 14:56 < fujin> Array: 362/31875 (0.01) 14:56 < fujin> String: 2/159653 (0.00) 14:56 < lak> any puppet classes showing up in the list? 14:57 < lak> if you want better data, it's probably more useful to sort by the ratio 14:57 < fujin> That's it 14:57 < fujin> (with no active connections) 14:57 < lak> i thought about adding it, but it was like 11pm last night (sat night for me) and i

just couldn't convince myself to

14:57 < fujin> Just those two :0 14:57 < fujin> when a node connects, more things happen 14:57 < lak> really? 14:57 < lak> ah, ok 14:57 < lak> but then it's all reclaimed 14:57 < lak> ok 14:57 < fujin> all of your Puppet magical classes are instantiated, then GC'd correctly 14:57 < lak> so that pretty clearly indicates it's a ruby leak, not a puppet leak

Valgrinding it now, It's obviously an Array of some-sort somewhere in Puppet which is causing this leak - None of the puppet classes are the leaky ones. I'm hoping the valgrind information will be helpful.

03/17/08 03:20:51 changed by Fujin

What a fail.

14:56 < fujin> lak: 180MB ram usage
14:56 < fujin> Top 20 retained
14:56 < fujin> Array: 362/31875 (0.01)
14:56 < fujin> String: 2/159653 (0.00)
14:56 < lak> any puppet classes showing up in the list?
14:57 < lak> if you want better data, it's probably more useful to sort by the ratio
14:57 < fujin> That's it
14:57 < fujin> (with no active connections)
14:57 < lak> i thought about adding it, but it was like 11pm last night (sat night for me) and i
             just couldn't convince myself to
14:57 < fujin> Just those two :0
14:57 < fujin> when a node connects, more things happen
14:57 < lak> really?
14:57 < lak> ah, ok
14:57 < lak> but then it's all reclaimed
14:57 < lak> ok
14:57 < fujin> all of your Puppet magical classes are instantiated, then GC'd correctly
14:57 < lak> so that pretty clearly indicates it's a ruby leak, not a puppet leak

03/17/08 03:28:47 changed by Fujin

Hmph; valgrind makes it so slow that it cannot even compile configurations in time. Is there some way to adjust the timeout for configuration compilation?

03/17/08 03:38:14 changed by Fujin

Blagh, genconfig answered my question. Going to see if I can get it to run with a configtimeout of 300 now.

03/17/08 23:34:40 changed by adamhjk

Okay. If you have an actively flakey puppet master, here are the steps you can follow to attach to it with GDB and do some introspection (in Ruby, no less!)

Attaching to puppetmasterd with GDB

Install GDB

Follow whatever steps make sense for your platform. Something like:

$ yum install gdb

Or

$ apt-get install gdb

Would work just fine.

Populate your .gdbinit and .gdb directories

In ~/.gbinit:

define session-ruby
  source ~/.gdb/ruby
end

Next, you need to download http://eigenclass.org/hiki.rb?c=plugin;plugin=attach_download;p=ruby+live+process+introspection;file_name=ruby

And place it in ~/.gdb/ruby.

$ mkdir -p ~/.gdb
$ wget 'http://eigenclass.org/hiki.rb?c=plugin;plugin=attach_download;p=ruby+live+process+introspection;file_name=ruby'
$ mv ruby ~/.gdb

Attach to the puppetmaster process, load the ruby helpers, and redirect stdout:

$ ps ax | grep puppemasterd
21170 ttyp5    S+     0:02 /usr/bin/ruby /srv/ruby/1.8.6/bin/puppetmasterd --no-daemonize --debug --trace

The important part here is the very first segment of the line (the PID).

$ gdb /usr/bin/ruby 21170

The first argument to gdb should be the full path to the ruby executable that is running your puppetmasterd.

Start your debug session

From above, you should see a bunch of output like this:

...
Loaded symbols for /srv/ruby/1.8.6/lib/ruby/1.8/i686-linux/nkf.so
0xbfffe402 in __kernel_vsyscall ()

And then you'll wind up at...

(gdb)

First things first, load up the ruby helpers.

(gdb) session-ruby

Then, redirect stdout to a file, so we can capture the output for later.

(gdb) redirect_stdout

Open up a second terminal, and you can watch the output from your debug session:

$ tail -f /tmp/ruby-debug.PID

Where PID above is equal to the PID you attached to with gdb.

Now that you are watching, flip back to your gdb session and..

Get some data

(gdb) eval "caller"
(gdb) rb_object_counts
(gdb) eval "total = ObjectSpace.each_object(Array){|x| puts '---'; puts x.inspect }; puts \"---\nTotal Arrays: #{total}\""

You should see a stack trace, a long count of the number of objects in the system. The last line is important, was Luke and Fujin think they are leaking Array's specifically. That will print the value of every array the puppetmasterd has, separated by ---.

It should be a huge amount of data, but it ought to get us pointed in a more specific direction regarding the source of the Array's.

03/18/08 01:18:09 changed by jamtur01

  • owner changed from community to adamhjk.

Adam

Can you put this in the wiki - perhaps in the FAQ?

Thanks

03/18/08 02:02:39 changed by luke

Note this thread on leaks, too:

http://groups.google.com/group/god-rb/browse_thread/thread/1cca2b7c4a581c2

I ran through the whole codebase and fixed all instances of this bug, and pushed it to the memory_leak_split_fix branch in my repository. In my testing, memory grows some, but stops growing at around 50MB and actually drops periodically, which never happened before.

03/18/08 10:49:57 changed by adamhjk

Okay, some more updates.

We discovered a very large (9K+ entries) array that seems to be related to the tagging of Puppet::Parser::Resources. You can see the array by doing something like:

(gdb) eval "by_size = Hash.new; total = ObjectSpace.each_object(Array){|x| by_size[x.object_id] = x }; by_size.sort { |a,b| a[1].length <=> b[1].length }.each { |elem| puts \"OBJ #{elem[1]}: #{elem[0].inspect}\" }; nil"

That will print the object ID: size for every array. I've been doing something like:

(gdb) eval "ObjectSpace.each_object(Array){|x| puts x.inspect if x.length == LENGTH}; nil"

Where LENGTH is the output after the colon from above. That'll show you the actual array contents that have grown so large.

Then:

(gdb) eval "total = ObjectSpace.each_object(Puppet::Parser::Resource){|x| puts '---'; puts x.inspect }; puts \"---\nTotal Parser::Resource: #{total}\""

Will print out all the Puppet::Parser::Resource objects, which I believe is where that array is hiding. Track that guy down, and I think you'll have all you need to figure this out.

Unless your Luke, at which point you might have enough to figure that leak out by poking around the tag stuff got implemented in the parser. :)

More to come.

03/18/08 10:54:13 changed by adamhjk

If you are affected by 1031, and also seeing issues with #1010 (files being replaced with bad contents, or turned magically into directories) please head over to #1010 and read my instructions. We need some more sample data, so we can narrow down whether #1010 and #1031 are actually related, or just friendly.

I promise to make it easy. Your gdb output could be the thing that helps fix 1010! Won't that feel great?

Love, Adam

03/21/08 06:04:55 changed by luke

The array is growing because of line 239 in lib/puppet/parser_support.rb.

I reproduced the problem with puppet-test, the 'local_catalog' suite. I just watched the biggest array and waited for one that got ever-bigger to show up, then froze it and watched for traces. It pointed me right to this line.

Well, technically, I just froze one that got above a certain size; that was pretty easy.

Now to see if I can fix it.

03/21/08 06:11:16 changed by luke

The following diff fixes it, but apparently breaks something else:

diff --git a/lib/puppet/parser/parser_support.rb b/lib/puppet/parser/parser_support.rb
index ccfc4d4..4180806 100644
--- a/lib/puppet/parser/parser_support.rb
+++ b/lib/puppet/parser/parser_support.rb
@@ -235,8 +235,10 @@ class Puppet::Parser::Parser
             end
         end
 
-        unless filename == mod and ! @loaded.include?(mod)
-            @loaded << mod
+        return loaded if loaded or filename == mod
+
+        unless @loaded.include?(filename)
+            @loaded << filename
             # Then the individual file
             begin
                 import(filename)

03/21/08 06:27:58 changed by luke

  • milestone changed from elmo to 0.24.4.

Okay, I fixed the leaking array in [f6325dc].

My changelog says I'm going to assume that this fixes the actual memory leak, although I'm personally a bit skeptical, since this leaking array should get reclaimed every time the files are reparsed, and people don't seem to be seeing memory growing and then dropping.

Should I close this ticket and assume it's fixed until I get other feedback, or what?

Is anyone willing to test it for me?

03/21/08 06:40:59 changed by luke

  • status changed from new to closed.
  • resolution set to fixed.

I've also applied, in commit [18320b8], my code that fixes the split problem discussed in the 'god' memory leak at http://snurl.com/21zf8.

Given that this is two problems fixed, I'm going to close the ticket. If there are further memory growth issues, please let me know.

03/21/08 23:24:45 changed by Fujin

I can't see those commits in the 0.24.x branch - are they in another branch? Will cherry-pick and do testing to see if it is resolved once I work out where they are ;>

03/21/08 23:32:27 changed by luke

They're in [18320b8e3271f7d1d1702907be1ff420acfc8d2b] and [f6325dceb3b10c300f421f540281bbd64bdc091e]. I might have forgotten to push them last night.