Why can't code be uncompiled?

Squizzy@lemmy.world · 1 year ago

Why can't code be uncompiled?

Feyr@lemmy.world · 1 year ago

4+4 is 8 But so is 6+2 And 7+1

You can’t guess which two numbers I started with knowing just the answer

Code is the same, just with much bigger numbers and more of them

Treczoks@lemmy.world · 1 year ago

Very nice explanation of a complex promlem.

MTK@lemmy.world · edit-2 1 year ago

I would say that it’s more like 4+4=8 but the original could have been (1+1+1+1)+(3+1) or (2+2)+(1+2+1) etc.

Basically it’s the same thing but if you really want to understand the code and modify it in any meaningful way you have to know how it was intended and not just the results.

My point being that decompiling does give you something similar to the original. It’s not just a guess that gives you random code with the correct result, but it could be very different from the source code.

The reason is that the compiler does a lot of things to make it more efficient but that just means that while 1+1+1+1 can be efficiently written as 4, there still is a good reason for 1+1+1+1 from a logical sense. For example, if you’re counting something, it would make sense to say 1+1+1+1. But if you’re looking at a specific value, maybe it makes more sense to just say 4.

Trigg@lemmy.world · 1 year ago

It can be

What it produces will typically not contain the original names for variables and functions, and will not retain comments. It takes a lot more effort to understand what the intention behind the code was.

There’s also legality issues.

fenynro@lemmy.world · edit-2 1 year ago

The long answer involves a lot of technical jargon, but the short answer is that the compilation process turns high level source code into something that the machine can read, and that process usually drops a lot of unneeded data and does some low-level optimization to make things more efficient during actual processing.

One can use a decompiler to take that machine code and attempt to turn it back into something human readable, but will usually be missing data on variable names, function calls, comments, etc. and include compiler-added optimizations which makes it nearly impossible to reconstruct the original code

It’s sort of the code equivalent of putting a sentence into Google translate and then immediately translating it back to the original. You often end up with differences in word choice that give you a good general idea of intent, but it’s impossible to know exactly which words were in the original sentence.

Squizzy@lemmy.world · 1 year ago

Thank you, sorry to push further but my understanding is that computers deal with binary so every language is compiled to machine code, which I took as binary.

So if the language has elements being removed and the machine doesn’t need them shouldn’t you get back out exactly what is needed to do the task? Like if you compiled some code and then uncompiled it you would get the most efficient version of it because the computer took what it needed, discarded the rest and gave it back to you?

fenynro@lemmy.world · edit-2 1 year ago

It depends on the specifics of how the language is compiled. I’ll use C# as an example since that’s what I’m currently working with, but the process is different between all of them.

C#, when compiled, actually gets compressed down to what is known as an intermediate language (MSIL for C# specifically). This intermediate file is basically a set of genericized instructions that are not linked to any specific CPU. This is useful because different CPUs require different instructions.

Then, when the program is run, a second compiler known as the JIT (just-in-time) compiler takes the intermediate commands and translates them into something directly relevant to the CPU being used.

When we decompile a C# dll, we’re really converting from the intermediate language (generic CPU-agnostic instructions) and translating it back into source code.

To your second point, you are correct that the decompiled version will be more efficient from a processing perspective, but that efficiency comes at the direct cost of being able to easily understand what is happening at a human level. :)

Squizzy@lemmy.world · 1 year ago

Could I trouble you to go deeper? I’m think I’m getting it but if we were to say uncompile GTA V or Super Mario Bros, could we make changes and figure it out from there or would it be complete nonsense with no way points to jump in at and get a grip on what is being done.

On a side note I was told once that everything is 1s and 0s and as a result that someone could type a picture of you if they got the order right. This could be why I’m so wrong in my understanding given I’m now assuming this was bullshit.

mindlessLump@lemmy.world · 1 year ago

Here is a real world example of someone doing some reverse engineering of compiled code. Might help you understand what is possible, and some of the processes. https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times-by-70/

folkrav@lemmy.ca · 1 year ago

At a very low level, yes, everything is 1s and 0s. However, virtually nobody deals with binary anymore. Programming languages are abstractions over abstractions over abstractions not to have to deal with typing binary.

The point of programming languages is for humans to be able to read it and make sense out of it. It’s a way to represent in a kind of intermediate language that’s halfway between something humans can read and computers can interpret.

Say the game’s programmer wants to handle moving your character right on pressing the right arrow key. They might write some function called “handleRightArrow()”, which does whatever. Then your compiler will turn this to some instructions - read stuff in RAM at address XYZ, copy it over, etc. The original code with readable names, comments, documentation, proper organization, it’s gone. Once you decompile, it’s gonna be random function/variable names, compiler might have rewritten some parts of the implementation as automatic optimizations, unlined some functions, etc. The human readable meaning of the code is lost. It does the same thing as the original code, but it isn’t the original code either.

litchralee@sh.itjust.works · edit-2 1 year ago

The implicit assumption with decompiling code is that the goal is either to inspect how the code works, or to try compiling for a different machine. I’ll try to explain why the latter is quite difficult.

As you said, compilation to machine code only keeps the details needed for the CPU to accomplish what was instructed. And indeed, that is supposed to be efficient to run on that CPU, by reason of being targeted exactly for that CPU. But when decompiling, the resulting code will reflect the specificity to that same CPU. If you then try to compile that code for a different CPU, it will likely work, but will likely be inefficient because the second CPU’s unique advantages won’t be leveraged.

To use an example, consider how someone might divide two large numbers. Person A learned long division in school, and so takes each number and breaks it down into a series of smaller multiplications and subtractions. Person B learned to do division using a calculator, which just involves entering the two numbers and requesting that they be divided.

Trying to do division by blindly giving Person B that series of multiplications and subtractions to do on the calculator is extremely inefficient because Person B knows how to do division easily. But Person B is following Person A’s methods, without knowing that the whole point of this exercise is to just divide the two original numbers. Compilation loses context and intent, which cannot be recovered from decompilation, for non-trivial programs.

Here is an example why source code is useful when it provides context: https://en.m.wikipedia.org/wiki/Fast_inverse_square_root#Overview_of_the_code . Very few people would be able to figure out how this works from just the machine code.

UnRelatedBurner@sh.itjust.works · edit-2 1 year ago

follow up, would it be easier to read this context-less source code or stay at assembly? If for example you’d like to modify a closed source app

fenynro@lemmy.world · 1 year ago

Probably depends on how comfortable you are at reading assembly instructions for your specific CPU, but I think generally the contextless source code is probably preferable. Either way you’ve got a headache of an investigation in front of you though.

here’s an example of what it might look like with either option

UnRelatedBurner@sh.itjust.works · 1 year ago

oh wow, I now respect pirates even more. No wonder there are only like 3 guys that can and will do this.

If you decompile you need such an understanding of the language. I could see someone looking at this and going “oh yeah that compares cases”, but then die of old age before finishing the sentance.

And if you don’t decompile you are coding assembly.

litchralee@sh.itjust.works · edit-2 1 year ago

Like many things, it’s very fact-intensive, varying in different circumstances. As others have noted, the abilities of the person undertaking the decompilation will influence the decision. But so will strategy: the overall goal can drive how decompilation is approached.

For example, suppose you’re working for an airline company and need to rewrite some software used on an ancient IBM System/360 machine and was written in the COBOL language, for which no source code is available and you cannot find many people who even know COBOL. Here, since the task is to rewrite the code, decompilation is just to tell you how it works and then you’ll want to write the new program in a modern language. It may be useful to decompile to a different language if such a decompiler is available, say to the C language, which you better understand.

Sure, it may be that C isn’t what the new program will be written in, but if your C reading skills are sufficient, then this is a valid strategy.

The skill of a decompiling engineer – or any engineer really – is leveraging your skills and your tools to tractably attack the difficult problem at hand. Many equally-skilled engineers can plausibly approach the same problem differently.

jayrhacker@kbin.social · 1 year ago

if you compiled some code and then uncompiled it you would get the most efficient version of it … ?

Sorta, an optimizing compiler will always trim dead code which isn’t needed, but it will also do things that are more efficient but make the code harder to understand like unrolling loops. e.g. you might have some code that says “for numbers 1-100 call some function” the compiler can look at this and say “let’s just go ahead and insert 100 calls to that function with the specific number” so instead of a small loop you’ll see a big block of function calls almost the same.

Other optimizations will similarly obfuscate the original programmers intent, and thinks like assertions are meant to be optimized out in production code so those won’t appear in the de-compiled version of the sources.

fidodo@lemmy.world · 1 year ago

You can. It’s called decompiling. Problem is you lose all the human friendly metadata that was in the original source code, meaning comments, variable names, certain code structures are lost forever because it was deleted in the compilation process. There are tools to help you reintroduce that stuff by going through the variables and trying to make sense out of what they were for but it’s super tedious. With new ai tech that can certainly be improved with AI guessing what they were for but you’ll never get the original meta data back.

howrar@lemmy.ca · 1 year ago

The best and simplest explanation I’ve seen: The machine code tells the computer what to do while the source code tells the human why it’s doing it.

Your computer doesn’t need all the “why” information to run the game, so the compilation process gets rid of it. What you’re left with are instructions on exactly what computations to do, and that’s all the computer needs.

For example, you can see in the machine code that two numbers are being added together. What do those numbers mean and why are we adding them? The source code can tell you that this is code that controls movement, one of the numbers is a velocity, the other is the player’s current position.

Squizzy@lemmy.world · 1 year ago

Okay, I think that is sinking in.

I was under the impression it wasn’t possible or just complete gibberish but it being just the results or instructions is helpful.

Ook the Librarian@lemmy.world · edit-2 1 year ago

Also, you only decompile to level of basic instructions that the processor understands. When you compile code to add two numbers, well, the processor only adds bytes. There are a quite a few steps that the compiler has to fill in.

Ok, all that is not a big deal. But then you deal with compiler optimization. Optimizing basically tells the compiler to take its time and find some clever ways to save machine steps. So now the “standard way” for a compiler to implement adding numbers may have other stuff rolled into it because the compiler may see an opportunity to save steps in a seemly unrelated calculation by inserting steps into the addition it is implementing. Now it’s basically unrecognizable. A human didn’t write, and wouldn’t have written that mess that the decompiler gives.

Edit: I would also like to add that when compile with the debugger flag, you are telling the compiler to produce decompilable code. Don’t change any steps and store variable names as written.

Rikudou_Sage@lemmings.world · 1 year ago

As I’ve read somewhere once: it’s easy to make a burger out of a cow. Making a cow out of a burger is slightly harder.

That means that compiling code is a lossy process - the original code is lost in the process and can never be recovered because it doesn’t exist anywhere anymore.

Donebrach@lemmy.world · 1 year ago

This is the fundamental notion of nearly 95% of cyberpunk stories re: the human soul and yet everyone always is like “but I want my cool robot hand!”

Rikudou_Sage@lemmings.world · 1 year ago

Fuck soul, I want my cool robot hand!

Moondance@sh.itjust.works · 1 year ago

The compilation process discards information in the process leaving a many to one effect. A good decompiler allows one to retrieve a program that is functionally equivalent to the source code but not exactly the source code.

ryathal@sh.itjust.works · 1 year ago

Code can be decompiled, but generally the end result isn’t human readable. Just having the decompiler version isn’t that valuable. Having the source code as written is more helpful because you get the context of what things were named and how it was organized.

Decompiled code is a bit like reading a book with all the nouns being random letters and verbs being random numbers.

TheVillageGuy@kbin.social · 1 year ago

Not completely random, every noun/verb would be translatable to a specific word/name. But also characters, there’d be many characters whose names, intentions and goals, relationships/links would also be in the same unreadable state. The storyline would likely not be chronological, but several actions and decisions by all kinds of actors would intertwine. It would be very hard to translate into a readable story, let alone so that it makes sense

the_q@lemmy.world · 1 year ago

The same reason you can’t unbake a cake I’d imagine.

Squizzy@lemmy.world · 1 year ago

Apparently not

NoIWontPickaName@kbin.social · 1 year ago

Has the cake been in closed in an airtight container since it was done baking?

the_q@lemmy.world · 1 year ago

I dunno.

NoIWontPickaName@kbin.social · 1 year ago

Potentially, yes, if the answer is yes

GlitzyArmrest@lemmy.world · 1 year ago

You can get close depending on the language by using decompilers. Usually though, they’re rough translations of what the decompiler thinks that the (compiled) machine code does. It’s not a 1:1 deal.

Basically, a compiler translates the human-readable code to machine code that can actually be recognized and executed by your computer. A decompiler attempts to do the opposite, it translates the machine code back into the original language. But like some “translators”, it’s not always correct. That’s the hard part - once decompiled you will likely have a lot of blanks to fill in and bugs to fix before anything will be compilable again. You’ll likely never be able to get an exact copy of the original source code via decompiler.

amio@kbin.social · 1 year ago

The general difference is that you lose out on metadata - names, comments and organization that helps the source code in whatever programming language make sense, but which is not needed to actually execute the desired behavior on your CPU. Usually stuff like sensible names for bits of your code - functions/reusable logic, storage locations for “health” or “armor” or “current powerup”, movement states, types of objects etc.

However, most of these are just another kind of number to the computer itself, so a lot of compilation processes strip a lot of this information. You could still reverse engineer it, but you’re missing context (like all those names) from the original code and that makes the work potentially pretty difficult. Bear in mind that reading actual original source code is sometimes cryptic enough, then compare “if player is dead, show game over screen” to if (sdfdfgsdfg == jgdfg) { lkghku(); } because the “decompiler” has to invent some kind of name for everything that’s missing. Now you have to deal with thousands of jfdsghklgs, and figure out what it all means.

RightHandOfIkaros@lemmy.world · edit-2 1 year ago

Code can be decompiled into code that can be recompiled, but compilers translate the human readable code into code that is easier to understand for machines. So decompiled code often ends up being nearly undecipherable for humans, and can take a long time to try and decipher.

CashewNut 🏴󠁢󠁥󠁧󠁿@lemmy.world · 1 year ago

deleted by creator

schnurrito@discuss.tchncs.de · 1 year ago

Others have explained that decompiling is a thing.

I mainly work in Java where (due to the way Java bytecode works) decompiled code is actually very close to the original source code.

Most games are written in low level languages like C++ where that is not the case, variable and function names are lost during compilation.

Moondance@sh.itjust.works · 1 year ago

But generally speaking it’s not a reversible process and it is more difficult to do in reverse.