Help me find my memory leak


#1

Hello:
After a hiatus due to medical reasons, I’ve decided to go back to an old base of mine(https://github.com/ptrayal/Project_Twilight) and work on debugging it and getting it more stable. The problem I’m stuck on is that there seems to be a memory leak. I’d been using this to try and find it (http://orcs.biz/mud/code/mudbytes.net/replace_buffer.c).

So, the output of using that bit of code is:
Buffer Found:File: Called by this function! (Ignore!)
Buffer Found:File: recycle.c, Function: new_pcdata, Line: 1011
Buffer Found:File: recycle.c, Function: new_pcdata, Line: 1011
3 buffers were found open.

Here is the code in new_pcdata for those who don’t want to look at it from github.
PC_DATA *new_pcdata(void)
{
int alias = 0;

PC_DATA *pcdata;

ALLOC_DATA(pcdata, PC_DATA, 1);

pcdata->confirm_delete		= FALSE;
pcdata->full_reset			= FALSE;

pcdata->acct_login		= NULL;
pcdata->bamfin			= NULL;
pcdata->bamfout			= NULL;
pcdata->block_join		= NULL;
pcdata->ignore_reject	= NULL;
pcdata->pwd				= NULL;
pcdata->rpok_string		= NULL;
pcdata->title			= NULL;

pcdata->points			= 0;
pcdata->security		= 0;
pcdata->true_sex		= 0;

for (alias = 0; alias < MAX_ALIAS; alias++)
{
	pcdata->alias[alias] = NULL;
	pcdata->alias_sub[alias] = NULL;
}

pcdata->buffer = new_buf(); // <<<<-----This line is the problem.

return pcdata;

}


#2

The very first thing you should do is a make clean, then fix the 40 or so compiler warnings that come up. You can fix the unused-but-set-variable and unused-result warnings last, but you absolutely must fix the maybe-uninitialized, pointer-compare, misleading-indentation, int-to-pointer-cast, sizeof-pointer-memaccess, and sizeof-array-argument ones if you want to make traction finding your leak.

Once it builds cleanly, you can run it under valkyrie/valgrind and use that to start hunting down the leak.

Good luck!


#3

Aright, all the compiler errors except for a few of the “unused-but-set-variable” compiler error messages have been fixed. Then I figured I would try running valgrind and see what came up. This seems to be what is coming up.
–13002-- REDIR: 0x4ef5e20 (libc.so.6:__strncat_sse2_unaligned) redirected to 0x4c32b90 (strncat)
–13002-- REDIR: 0x4ef6db0 (libc.so.6:__strstr_sse2_unaligned) redirected to 0x4c376e0 (strstr)
==13002== Invalid write of size 1
==13002== at 0x14AD4D: game_loop_unix (comm.c:729)
==13002== by 0x14B35C: main (comm.c:949)
==13002== Address 0x582ebfd is 1,069 bytes inside a block of size 1,656 free’d
==13002== at 0x4C30D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13002== by 0x137BE3: free_descriptor (recycle.c:285)
==13002== by 0x1432FC: close_socket (comm.c:1317)
==13002== by 0x2059E6: do_quit (act_comm.c:2071)
==13002== by 0x1DB6C0: interpret (interp.c:963)
==13002== by 0x20C86B: substitute_alias (alias.c:64)
==13002== by 0x14ADB1: game_loop_unix (comm.c:718)
==13002== by 0x14B35C: main (comm.c:949)
==13002== Block was alloc’d at
==13002== at 0x4C31B25: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13002== by 0x137A2F: new_descriptor (recycle.c:255)
==13002== by 0x1406F9: init_descriptor (comm.c:1176)
==13002== by 0x14AAD8: game_loop_unix (comm.c:647)
==13002== by 0x14B35C: main (comm.c:949)

Suggestions now on what I should be looking for/at?
Thanks.


#4

The original ROM implements its own memory manager.
For an explanation see…
http://sourcery.dyndns.org/wiki.cgi?RomMemorySystem

I’m not sure the replacement buffer code you are using handles memory correctly.


#5

Can’t really debug your code with what you have here. Valgrind indicates the block you are attempting to access has been freed and from the later code it looks like it is probably being freed properly (new_description() for alloc, then free_descriptor() for unalloc). That suggests that something somewhere is either still referring to the descriptor or to some element in the descriptor struct… or you have a wild pointer somewhere that is indiscriminantly out of bounds. Look at comm.c on line 729 may help shed some light on which issue is at play.


#6

Ok, I dunno about your leak… but I can help you with interpreting that valgrind output. I’ll do it here instead of in a PM because I think it might be helpful for others to see the debugging process…

The first thing Valgrind is telling you that you the statement on line comm.c:729 is trying to write one byte to a piece of memory that valgrind knows has already been free’d. From looking at the source line, it is trying to write a ‘\0’ character to the first byte of the (struct descriptor_data *)d->incom array of characters. Since that array is a statically allocated part of your descriptor_data and not a pointer to another string somewhere, it means that the code has somehow managed to be working on a descriptor_data item that it must have gotten out of descriptor_list, but that’s already supposed to have been removed from the list somehow. How can that happen?

The next thing Valgrind is telling you is what free’d that memory block. It’s giving you the whole call stack of what did the free, and you can see that it was was a close_socket() called by do_quit(). That makes sense, because the write error happens when a character types “quit”. So why is that a problem?

The last thing Valgrind gives you is what originally allocated the memory block that comm.c:729 is trying to write into. From the call trace, the block came from new_descriptor. That implies the memory block was at one time a valid, allocated (struct descriptor_data). Its not like comm.c:729 is completely off in the weeds- its working on something that used to be a descriptor_data, but suddenly isn’t anymore. That is likely because of some order of operations issue. What could it be?

Here’s what it probably is.

game_loop_unix() has a few loops that iterate over the descriptor list. It loops over the list to process new socket descriptors, then it loops to handle any sockets with errors, it loops to process input, and it loops to process output. Those loops are all written to hold the current and next item of the linked descriptor_data in the variables d and d_next. It does it so that if it has to call close_socket(d) and the (struct descriptor_data*) pointed to by d suddenly gets free’d, the loop can still remembers what the value of d->next was supposed to be without having access to d anymore.

However, doing it that way implies that the ONLY thing that can modify the descriptor_list is that loop itself, otherwise its values for what the current and next items in the list are (d and d_next) may not be accurate.

And the smoking gun? do_quit() calling close_socket() from inside the input processing loop. Close socket free’s the descriptor_data, but it doesn’t have any way to let the parent callers know it happened. game_loop_unix() has no way of knowing that calling substitute_alias() had the effect of free’ing the descriptor_data pointed to by d, so game_loop_unix() has no way of knowing that it can’t use the data pointed to by d again like it tries to do in comm.c:729. That’s your invalid write.

There are several ways to solve it… an obvious but complex one is to have your callers provide a return value that can be used to indicate that the descriptor_data is gone. A much easier one might be to add a new value like CON_CLOSING do your d->connected types. Have do_quit() set that connected type instead of calling close_socket() directly, and then have game_loop_unix() check for that change of connected-ness right after input processing, and let it handle the close_socket() instead.

Hope that helps. :slight_smile:


#7

Seem to remember a leak coming up like that in the ROT you worked on with me, turned out it was something to do with substitut_alias i think… I know with discussion with another coder we worked out a fix.


#8

Ok, let me see if I can answer some of the stuff here.

So, part of what happens in my mud shutdown is it runs a command called cleanup_mud, which goes through and per the comment, cleans up various lists and cleaning.
Code from comm.c
// Start Going through the different lists and cleaning.

	log_string(LOG_ERR, "cleaning character list");
	for(ch = char_list; ch != NULL; ch = ch_next)
	{
		ch_next = ch->next;

		extract_char(ch, true);
	}

	log_string(LOG_ERR, "cleaning objects");
	for(obj = object_list; obj != NULL; obj = obj_next)
	{
		obj_next = obj->next;

		extract_obj(obj);
	}

However, it never makes it to cleaning objects. Which means it getting stuck in extract_char. So, I put a bunch of logs in it to see what is going on there. And it makes it “Extract Case 6”, but never more. But I can’t figure out why it is not logging anything else. This also matches what Rahjilll said with it being a problem somewhere in do_quit, which also uses extract_char.
log_string(LOG_COMMAND, “Extract Char 6 -> Successfully after Remove NPC from pIndexData.”);

if ( ch->desc != NULL && ch->desc->original != NULL )
{
	do_function(ch, &do_return, "" );
	log_string(LOG_COMMAND, "Extract Char 6a -> Successfully");
	ch->desc = NULL;
	log_string(LOG_COMMAND, "Extract Char 6b -> Successfully");
}
log_string(LOG_COMMAND, "Extract Char 7 -> Successfully after do_return.");

It always logs 6, but not 6a or 6b. I’m hoping this helps because I’m really confused at this point.


#9

Hummm… what’s it look like when you trace it out with gdb? Also, how are you shutting down the game exactly?

Just an observation, but you do have kind of the same problem going on here as you did above… you are running code that modifies a linked list while you are already iterating over that same linked list in a parent caller. That can be really error prone if you aren’t being careful with it.

One I’m looking at is near comm.c:384, “cleaning character list”. You’ve got a for() loop going over the char_list to call extract_char() on all of ch’s- but extract_char() also loops over and modifies the same char_list. Maybe instead of the for() loop, you could try

while(char_list != NULL) {
extract_char(char_list,true)
}

If you are shutting down by running the shutdown command as an in-game player, you’ve got that problem too- your do_shutdown command is iterating over the descriptor_list to close descriptors, at the same time that the parent caller game_loop_unix is already going over the descriptor_list to process input.


#10

Hi there. I’ll check this when I get to my other computer, but I will definitely try that when I get home. And I am shutting down by running the shutdown command. I guess part of the problem is I’m not very comfortable or familiar with linked list, so most of what you say about that I need some sort of primer/tutorial on.


#11

Ok, so put in the while loop and this is what I get. If I log in (the first time), it goes past step 7 and life is good when I quit. It goes to step 7 and everything seems fine. If I login again and then quit, it stops at Step 6 and the trackbuffer shows that I still am having multiple open buffers (which is supposed to tell me I’m leaking someplace).

Now, it is different if I login and immediately shutdown. Then it continues to stop at step 6 like it did previously.


#12

Have a look at comm.c in cleanup_mud(), the section “cleaning active descriptors”.

You’re iterating over the descriptor_list, freeing every descriptor that you see- but you aren’t removing them from the descriptor list first. Then your next call to log_string() ultimately calls something named wiznet() that tries to loop over the descriptor_list again- after you’ve corrupted it by leaving it full of pointers to free’d descriptors.

Instead of this code in comm.c cleanup_mud:

   log_string(LOG_ERR, "cleaning active descriptors");
    for(d = descriptor_list; d != NULL; d = d_next)
    {
        d_next = d->next;

        free_descriptor(d);
    }

Do this:

log_string(LOG_ERR, "cleaning active descriptors");
while((d=descriptor_list)) {
    descriptor_list = d->next;
    free_descriptor(d);
}

And in do_shutdown, I’d get rid of that call to close_socket(d). Cleanup_mud will take care of that once the program is out of the descriptor input service loop.

…and now for the specific buffer loop you’re asking about, in free_char() just before the line that reads “PURGE_DATA(ch->pcdata);”, try adding the line “if(ch->pcdata) free_pcdata(ch->pcdata);”

Again, hope that helps. It’s been kinda fun looking at someone else’s code for a change. :slight_smile:


#13

And I appreciate the help. Ok, so put in the code you suggested and now I’m getting this when I shutdown at the command line.

project: handler.c:824: get_trust: Assertion `ch' failed.
Aborted (core dumped)

So, I ran the mud in gdb and got this when I shutdown.

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff7a24801 in __GI_abort () at abort.c:79
#2  0x00007ffff7a1439a in __assert_fail_base (fmt=0x7ffff7b9b7d8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x5555556870be "ch", file=file@entry=0x5555556859ec "handler.c", line=line@entry=824, 
    function=function@entry=0x555555686208 <__PRETTY_FUNCTION__.8247> "get_trust") at assert.c:92
#3  0x00007ffff7a14412 in __GI___assert_fail (assertion=assertion@entry=0x5555556870be "ch", file=file@entry=0x5555556859ec "handler.c", 
    line=line@entry=824, function=function@entry=0x555555686208 <__PRETTY_FUNCTION__.8247> "get_trust") at assert.c:101
#4  0x0000555555622aaf in get_trust (ch=<optimized out>) at handler.c:824
#5  0x0000555555633a9a in wiznet (string=string@entry=0x7fffffffbda0 "Command: Extract Char 2 -> Successfully die_follower.", 
    ch=ch@entry=0x0, obj=obj@entry=0x0, flag=flag@entry=4194304, flag_skip=flag_skip@entry=0, min_level=min_level@entry=0) at act_wiz.c:169
#6  0x00005555556177c0 in log_string (type=type@entry=64, fmt=fmt@entry=0x555555686478 "Extract Char 2 -> Successfully die_follower.")
    at db.c:5033
#7  0x0000555555625f60 in extract_char (ch=0x555555baa600, fPull=<optimized out>) at handler.c:2570
#8  0x000055555558b6a9 in cleanup_mud (control=-1) at comm.c:386
#9  0x0000555555597358 in main (argc=2, argv=0x7fffffffe008) at comm.c:963

So, I expanded into them and got this.

(gdb) frame 3
#3  0x00007ffff7a14412 in __GI___assert_fail (assertion=assertion@entry=0x5555556870be "ch", file=file@entry=0x5555556859ec "handler.c", 
    line=line@entry=824, function=function@entry=0x555555686208 <__PRETTY_FUNCTION__.8247> "get_trust") at assert.c:101
101	assert.c: No such file or directory.
(gdb) frame 4
#4  0x0000555555622aaf in get_trust (ch=<optimized out>) at handler.c:824
824		assert(ch);
(gdb) list
819	/*
820	 * Retrieve a character's trusted level for permission checking.
821	 */
822	int get_trust( CHAR_DATA *ch )
823	{
824		assert(ch);
825	
826		if ( ch->desc != NULL && ch->desc->original != NULL )
827		{
828			ch = ch->desc->original;
(gdb) frame 5
#5  0x0000555555633a9a in wiznet (string=string@entry=0x7fffffffbda0 "Command: Extract Char 2 -> Successfully die_follower.", 
    ch=ch@entry=0x0, obj=obj@entry=0x0, flag=flag@entry=4194304, flag_skip=flag_skip@entry=0, min_level=min_level@entry=0) at act_wiz.c:169
169					&&  IS_ADMIN(d->character)
(gdb) list
164		DESCRIPTOR_DATA *d;
165	
166		for ( d = descriptor_list; d != NULL; d = d->next )
167		{
168			if (d->connected == CON_PLAYING
169					&&  IS_ADMIN(d->character)
170			&&  IS_SET(d->character->wiznet,WIZ_ON)
171			&&  (!flag || IS_SET(d->character->wiznet,flag))
172			&&  (!flag_skip || !IS_SET(d->character->wiznet,flag_skip))
173			&&  get_trust(d->character) >= min_level
(gdb)

#14

Ok, a bit more information. So, I commented out the assert command in get_trust and reran the code after compiling. This lead to this crash and subsequent gdb output.

Program received signal SIGSEGV, Segmentation fault.
get_trust (ch=0x0) at handler.c:826
826		if ( ch->desc != NULL && ch->desc->original != NULL )
(gdb) bt
#0  get_trust (ch=0x0) at handler.c:826
#1  0x0000555555633a1f in wiznet (string=string@entry=0x7fffffffbda0 "Command: Extract Char 2 -> Successfully die_follower.", 
    ch=ch@entry=0x0, obj=obj@entry=0x0, flag=flag@entry=4194304, flag_skip=flag_skip@entry=0, min_level=min_level@entry=0) at act_wiz.c:169
#2  0x00005555556177c0 in log_string (type=type@entry=64, fmt=fmt@entry=0x5555556863e8 "Extract Char 2 -> Successfully die_follower.")
    at db.c:5033
#3  0x0000555555625f38 in extract_char (ch=0x555555ba7d30, fPull=<optimized out>) at handler.c:2570
#4  0x000055555558b6a9 in cleanup_mud (control=-1) at comm.c:386
#5  0x0000555555597358 in main (argc=2, argv=0x7fffffffe008) at comm.c:963
(gdb) frame 1
#1  0x0000555555633a1f in wiznet (string=string@entry=0x7fffffffbda0 "Command: Extract Char 2 -> Successfully die_follower.", 
    ch=ch@entry=0x0, obj=obj@entry=0x0, flag=flag@entry=4194304, flag_skip=flag_skip@entry=0, min_level=min_level@entry=0) at act_wiz.c:169
169					&&  IS_ADMIN(d->character)
(gdb) list
164		DESCRIPTOR_DATA *d;
165	
166		for ( d = descriptor_list; d != NULL; d = d->next )
167		{
168			if (d->connected == CON_PLAYING
169					&&  IS_ADMIN(d->character)
170			&&  IS_SET(d->character->wiznet,WIZ_ON)
171			&&  (!flag || IS_SET(d->character->wiznet,flag))
172			&&  (!flag_skip || !IS_SET(d->character->wiznet,flag_skip))
173			&&  get_trust(d->character) >= min_level
(gdb) frame 2
#2  0x00005555556177c0 in log_string (type=type@entry=64, fmt=fmt@entry=0x5555556863e8 "Extract Char 2 -> Successfully die_follower.")
    at db.c:5033
5033		wiznet(bufee,NULL,NULL, WIZ_LOG, 0, 0);
(gdb) list
5028			fprintf( log_file, "%s :: %s\n", buffer, bufew );
5029			fflush(log_file);
5030			fclose( log_file );
5031			snprintf(bufee,sizeof(bufee), "Command: %s",bufew);
5032		}
5033		wiznet(bufee,NULL,NULL, WIZ_LOG, 0, 0);
5034	}
5035	
5036	
5037	MPROG_CODE *get_mprog_index( int vnum )
(gdb)

#15

Try adding this after line 167 of act_wiz.c

if(d->character == NULL) continue;

Your IS_ADMIN() macro ultimately calls get_trust(), which is also not NULL protected. You probably want to have that return an untrusted level if it gets passed a null ch pointer, instead of wetting the bed like it does.

As a general suggestion, you should get rid of every one of those assert() calls that are left in your project. All they really do is make debugging more difficult. Replace them with code that handles the error appropriately and you’ll be better off.

Do a git commit and I’ll have another look at your changes. :slight_smile:


#16

Alright, put in the line in act_wiz.c and commented out the asserts for the time being (I’ll go back and error check them differently soon) and then did another commit so you can check it out.

I do greatly appreciate your help and assistance.


#17

Ok. So I encountered this crash bug when I made a new character (which only happened because I put together another computer).

Program received signal SIGSEGV, Segmentation fault.
number_argument (argument=argument@entry=0x0, arg=arg@entry=0x7fffffffc0b0 "") at interp.c:1173
1173		for ( pdot = argument; *pdot != '\0'; pdot++ )
(gdb) bt
#0  number_argument (argument=argument@entry=0x0, arg=arg@entry=0x7fffffffc0b0 "") at interp.c:1173
#1  0x00005555555d0f11 in get_char_room (ch=ch@entry=0x555555babcb0, argument=argument@entry=0x0) at handler.c:2736
#2  0x00005555555d1681 in get_char_world (ch=ch@entry=0x555555babcb0, argument=0x0) at handler.c:2787
#3  0x00005555555d17cd in can_see_obj_main (ch=ch@entry=0x555555babcb0, obj=obj@entry=0x555555bae280) at handler.c:3410
#4  0x00005555555d1b10 in can_see_obj (ch=ch@entry=0x555555babcb0, obj=obj@entry=0x555555bae280) at handler.c:3470
#5  0x000055555559f2f3 in do_wear (ch=0x555555babcb0, ch@entry=<error reading variable: value has been optimized out>, 
    argument=<optimized out>, argument@entry=<error reading variable: value has been optimized out>) at act_obj.c:2095
#6  0x000055555561b908 in do_function (ch=<optimized out>, do_fun=<optimized out>, argument=<optimized out>) at interp.c:1019
#7  0x0000555555590228 in nanny (d=0x555555ba8ca0, argument=<optimized out>) at comm.c:3154
#8  0x00005555555912b4 in game_loop_unix (control=4) at comm.c:736
#9  0x0000555555591878 in main (argc=2, argv=0x7fffffffdfe8) at comm.c:960
(gdb) list
1168	int number_argument( char *argument, char *arg )
1169	{
1170		char *pdot;
1171		int number = 0;
1172	
1173		for ( pdot = argument; *pdot != '\0'; pdot++ )
1174		{
1175			if ( *pdot == '.' )
1176			{
1177				*pdot = '\0';
(gdb) frame 1
#1  0x00005555555d0f11 in get_char_room (ch=ch@entry=0x555555babcb0, argument=argument@entry=0x0) at handler.c:2736
2736		number = number_argument( argument, arg );
(gdb) list
2731		CHAR_DATA *rch;
2732		char arg[MIL]={'\0'};
2733		int number = 0;
2734		int count = 0;
2735	
2736		number = number_argument( argument, arg );
2737	
2738		if ( !str_cmp( arg, "self" ) || !str_prefix( arg, "ch->name" ) )
2739		{
2740			return ch;
(gdb) frame 3
#3  0x00005555555d17cd in can_see_obj_main (ch=ch@entry=0x555555babcb0, obj=obj@entry=0x555555bae280) at handler.c:3410
3410		if ((owner = get_char_world(ch, obj->owner)) != NULL)
(gdb) list
3405		if ( !IS_SET( ch->act, ACT_UMBRA ) && IS_SET(obj->extra_flags, ITEM_UMBRAL) && !IS_ADMIN(ch))
3406		{
3407			return FALSE;
3408		}
3409	
3410		if ((owner = get_char_world(ch, obj->owner)) != NULL)
3411		{
3412			if ( ch != owner && IS_SET(obj->extra_flags, ITEM_HIDDEN) && !IS_ADMIN(ch))
3413			{
3414				return FALSE;
(gdb)

#18

OK, fixed the crash bug I posted. Still have the trackbuffer leak though. :slight_smile:


#19

Thanks for the help with this. It now properly shuts down and the leak seems to be gone when I quit and log back in seems to good. Thank you (until next time I need help).