Loading ...

What is a Buffer Overflow and How Hackers Exploit these Flaws part #1

Unmasking the Hidden Threat - Buffer Overflow Vulnerabilities Demystified

10 Nov 2020
305.4K views
20-24 min read

Introduction

Buffer overflows pose a critical threat to software security. In this comprehensive guide, we explore the intricacies of these vulnerabilities, how they occur, and the devastating consequences they can unleash. Whether you’re a developer, security enthusiast, or curious reader, understanding buffer overflows is essential in safeguarding digital systems.


Overviews

In information security and programming, a buffer overflow is an anomaly where a program, while writing data to a buffer, overruns the buffer's boundary and overwrites adjacent memory locations. Buffers are areas of memory set aside to hold data, often while moving it from one section of a program to another, or between programs. If one assumes all inputs will be smaller than a certain size and the buffer is created to be that size, then an anomalous transaction that produces more data could cause it to write past the end of the buffer.

This article is the first of a series dedicated to binary exploitation which is also more commonly known as "pwn". You are most likely wondering what this means? Commonly it's the art of exploiting vulnerabilities linked to a misuse of a language like for example "C" or "C++". You will certainly hear people putting forward that the term "pwn" can be used for any exploitation of a flaw linked to a bad use of any language like code injections or anything else but we will stay on the first definition for the moment.

One of the most well-known vulnerabilities related to the misuse of a programming language is the buffer overflow. Today we will together try to understand the basics of this vulnerability.

Let's Start by the Beginning

When you develop a program in a language such as "C" or "C++" and compile it with gcc using the command:

gcc -o program program.c

Do you know how gcc does to transform your code from "C" to computer-executable machine language? To make it short, we can say that the process is done in 3 steps:

  1. The code in "C" is transformed into assembly language which is the lowest level language after binary.
  2. The assembly code is translated into binary.
  3. The executable is "linked", in another way we can say the links are made with the libraries used by the code.

Now let's see the basics of assembler since this language is essential to understand the process of exploitation.

section .text
    global _start

_start:
    push rdx
    mov rdi, 0x4444444444444444 ; v_addr
    mov rsi, 0x5555555555555555 ; len
    mov rdx, 0x7 ; RWX
    mov rax, 10 ; mprotect0x80483dc
    syscall
    mov rcx, 0x2222222222222222
    mov rsi, 0x3333333333333333
    mov rdx, 0x6666666666666666 ; random_int
    mov rdi, rsi
    jmp _loop

_loop:
    cmp rcx, 0x0
    je _end
    lodsb
    not al
    xor al, dl
    stosb
    loop _loopon

_end:
    pop rdx
    mov rax, 0x1111111111111111
    jmp rax

There is nothing to worry about if you don't understand! The purpose of the above piece of code was just to show you how it looks like. As you can see the code is made with instructions such as push, mov, cmp, and much more.

; Note, in assembler everything behind a semicolon is considered as a comment
call 0x80483dc; Call function at address 0x80483dc
push 0x0; puts the value 0x0 on the stack
pop ebx; put what is at the top of the stack in ebx
mov eax, 0x1; puts 0x1 in eax

As you can see is not complicated. Now, what is eax or ebx? These are what we call registers. In another way, we can say a sort of block where we store some values such as an address, number, etc ... For a 32 bit processor the registers eax, ebx, ecx, edx, ebp, esp, eip, and edi have a size of 8 bits. There are other registers but to be honest they are out of any interest for us for the moment. One more thing, for a 64 bits processor, we will use the same registers, except that they will be 16 bits instead of 8 bits and the "e" will be replaced by "r", so the registers name will become rax, rbx, rcx, rdx, rbp, rsp, rip and rdi.

Let's see some technical points and what does each of these registers contain?

  • eax -> Accumulator Register ~ used for arithmetic operations and storing the return value of system calls
  • ebx -> Base Register ~ used as a data pointer located in DS in segmented mode
  • ecx -> Counter Register ~ used to count the register
  • edx -> Data Register ~ used for arithmetic operations and input/output operations
  • ebp -> Extended Base Pointer ~ used to define the base pointer for the current stack frame
  • esp -> Stack Pointer ~ used to define the current stack pointer
  • eip -> Instruction Pointer ~ used to tells the computer where to go next

The Memory Segments

Being on the exploitation of Linux systems we will rather be interested in the different segments of an ELF binary (Executable Linkable Format) which is an executable file under Linux.

I'm sure that the first question crossing your mind is "Nice but what all of this mean?". Well, it's simple, it's just our segments, let's see what each contains!

  • .data - Segments where global variables are stored.
  • .bss - Contains static variables.
  • .text - Contains our code.

It may not be clear enough, so let's take a little piece of code to illustrate this:

#include <stdio.h>
#include <stdlib.h>

int a;
static int b;

int main(int argc; char **argv) {
    int c = 0;
    printf("Hello ! \n");
}

In the above example we can see the following:

  • a and b are in .bss.
  • c is put on the stack.
  • Our function main is in .text.

Important: The variable include and define are part of what we call preprocessor when our code is compiled these parts will be removed automatically. Let's take a closer look to allow us to see how to disassemble a program:

#define NUMBER 15

int main() {
    int a = NUMBER;
}

No we will make a test by compiling the above small piece of code. First of all, open your terminal and simply use the following commands one by one:

mkdir ~/Backtest
cd ~/Backtest
cat >> ~/Backtest/code.c << EOL
#define NUMBER 15

int main() {
    int a = NUMBER;
}
EOL

gcc -o code code.c
file code

Now that we compiled our file, we will decompile it using objdump which is a linear disassembling tool present in most modern GNU/Linux distribution.

objdump -M intel -d code

Don't get scared, all that you see is correct! Let's focus on the "0000000000001119

" section.

0000000000001119 <main>:
    1119:    55                       push   rbp
    111a:    48 89 e5                 mov    rbp,rsp
    111d:    c7 45 fc 0f 00 00 00     mov    DWORD PTR [rbp-0x4],0xf
    1124:    b8 00 00 00 00           mov    eax,0x0
    1129:    5d                       pop    rbp
    112a:    c3                       ret    
    112b:    0f 1f 44 00 00           nop    DWORD PTR [rax+rax*1+0x0]

Do you remember what we saw above about rbp? If you already forgot, I kindly suggest you go back to the top! So let's continue and check together what exactly our program does.

  • push rbp, put rbp on the stack
  • mov rbp, rsp, puts rsp in rbp
  • mov DWORD PTR [rbp-0x4], 0xf puts 0xf (15 in hexadecimal) in rbp
  • mov eax, 0x0 puts 0 in eax
  • pop rbp puts what's at the top of the stack in rbp
  • ret quit main

As you can see there is no define variable. We can even say there is no variable name either but ultimately we do have 15 in rbp and it's exactly what we wanted no?

We will try to make this more clear just to be sure that you got it. Let's take this little piece of code:

cat >> ~/Backtest/piece.c << EOL
#include <stdio.h>

int main(void) {
    return 0;
}
EOL

gcc -o piece piece.c
size piece

Now let's see what happens if we modify a little bit of our code by adding a new variable.

cat >> ~/Backtest/piece.c << EOL
#include <stdio.h>

int globalvariable;

int main(void) {
    return 0;
}
EOL

gcc -o piece piece.c
size piece

As you can see bss went from 8 bytes to 12 bytes we can basically and easily consider that our global variable is stored into bss since his value did increase.

Now we will make another test by including a static variable into our main function and see what happens.

cat >> ~/Backtest/piece.c << EOL
#include <stdio.h>

int main(void) {
    static int staticvariable;
    return 0;
}
EOL

gcc -o piece piece.c
size piece

As you can see the variable was not stored in bss like in the previous example but in data which went from 512 bytes to 516 bytes. That looks very interesting no?

We will make a final test to see what could happen if we initialize a global variable.

cat >> ~/Backtest/piece.c << EOL
#include <stdio.h>

int global = 1;

int main(void) {
    return 0;
}
EOL

gcc -o piece piece.c
size piece

We got the same result and do you know why? A global variable, if it is initialized, will be put in data. I think we developed enough about this point for you to understand where and how the variables are stored.

How Does Memory Work?

Here we will not talk about your physical memory but rather RAM and how is it managed by your operating system. A process that runs on a machine requires memory, and in a computer, the amount of memory is limited.

Processes must, therefore, go looking for available memory to be able to work. Let's say several processes are running at the same time. What would happen if two processes wanted to access the same memory area? And if ever a process wrote in a memory area, then another process overwrites this same memory area with its data, then the first process will think of finding its data, but it will find the data of the second process. This could be a very big problem no?

This is where a major feature of your operating system comes to provide a solution to this issue, by allocating to each process a range of virtual memory which is limited to 4GB on 32-bit systems and 8 on 64-bit systems.

Each process will be able to use the memory addresses it needs without worrying about the other processes, the kernel of the operating system will manage to link virtual memory and real memory.

The Stack And The Heap

We will now move on to something very important. The heap can be manipulated by the programmer. This is the part of the memory in which the dynamically allocated memory areas malloc() or calloc() is written.

This memory area does not have a fixed size. It increases and decreases according to what we ask, we can reserve or delete blocks via allocation or release algorithms for future use. The larger the heap size, the larger the memory addresses, and the closer they are to the memory addresses in the stack. The size of the variables in the heap is not limited except the physical memory limit, unlike the stack.

Variables stored in the heap are accessible everywhere in the program with pointers. The stack also has a variable size, but the more its size increases, the more the memory addresses decrease, approaching the top of the heap. The stack grows down. This is where we find the local variables of the functions.

The stack frame of a function is a memory area in the stack in which all the information necessary for calling this function is stored. There are also local variables for the function.

Deepen the Concept of Stack

Let's start with LIFO, which doesn't represent anything complicated since we already see it a bit previously. LIFO stands for Last In, First Out. That is to say that the last thing put on the stack is the first that we will release, we saw it in particular through pop and push.

  • We're doing a pop rax; rax is on the stack
  • We do a push rbp; we have rax in rbp, rax is removed from the stack

Actually no, it isn't totally like this. rax stays at the same address except that rsp now points to a lower address.


Finally The Buffer Overflow

Finally, we come to the exploitation part. Let's take a small piece of code like the below one.

#include <stdio.h>

int main () {
    char buffer [100];
    int a;
    scanf ("% s", buffer);
}

The code looks completely normal, you must have used the scanf function in your learning of the "C" language. But what if we look at what the stack looks like in this example.

[buffer (100)] [int a] [saved ebp] [saved eip]

You may be wondering what is saved ebp and saved eip? Well, we don't care about "saved ebp", what interests us is "saved eip". Do you remember what eip contains? The address of the next instruction to execute. And if we change this address we could run anything!

But how do you change this value? Well, it's very simple! You see scanf does not check the number of characters it receives! Let's take a piece of code to make it more clear.

#include <stdio.h>

int adminfunction() {
    printf("You successfully jumped on 'adminfunction' ! \n");
}

int vuln() {
    char buffer[10];
    scanf("%s", buffer);
}

int main() {
    vuln()
}

If during the execution of scanf we give a too large value, for example, several "A" this will cause a buffer overflow. The program will, therefore, return a segmentation fault to us. But if we changed the "A" value by a valid address we could jump anywhere or in particular on the adminfunction(). It's done for the theoretical part. We will see the practice in the next article.


Conclusion

In conclusion, buffer overflows remain a persistent challenge for developers worldwide. By adopting secure coding practices, vigilant testing, and continuous education, we can mitigate these risks and fortify our applications against malicious exploits.

Nicolas C.
Created by
Nicolas C.

Don’t Want to Miss Anything?

Sign up for Newsletters

* Yes, I agree to the terms and privacy policy
Top