Issue
I was asked to refactor the following piece of code:
/** * This method receives a list of {@link Employee} objects and will: * * 1- Filter out non-active {@link Employee} objects * * 2- Filter out {@link Employee} objects with invalid codes * * 3- Add up all the salaries of the remaining {@link Employee} objects and return it * * @param employeeList a list of {@link Employee} objects * * @return a {@link BigDecimal} that represents the total salaries of the {@link Employee} objects * * that are not expired and have valid codes */ public BigDecimal getActiveAndValidEmployeesSalariesTotal(final List<Employee> employeeList) { BigDecimal employeeSalaries = new BigDecimal(0); // 1- Filter out non-active {@link Employee} objects List<Employee> activeEmployees = employeeList.stream().filter(e -> e.active) .collect(Collectors.toList()); // 2- Filter out {@link Employee} objects with invalid codes for (Employee employee : activeEmployees) { List<Long> employeeInvalidCodes = employeeRepository.findInvalidCodesByEmployeeId(employee.id); if (employeeInvalidCodes.contains(employee.code)) { activeEmployees.remove(employee); continue; } employeeSalaries = employeeSalaries.add(employee.salary); } return employeeSalaries; }
My best attempt was that the employee entity might already have the list of invalid codes, that way we wouldn't need to look for them by calling employeeRepository.findInvalidCodesByEmployeeId(employee.id)
. My other take was that we could cache all those invalid codes per employee. Apparently that was not a good answer.
What's the correct approach to refactor this code from a performance point of view?
Solution
Why can employee objects with invalid codes even exist? The problem with invalid state is that you have to find some default behaviour (in this case, count their salary as zero), and this is both tiresome / arbitrary, and a lot of legwork. It is far better to set it up such that it is simply impossible for an instance of
Employee
to exist that is invalid; for example, use an enum for whatever 'code' might be that only contains valid codes. If you can't have that, then do the next best thing: Do not ever let invalid-state objects 'infect' the rest of a code base (i.e. it should never even make it to this method), and then, when encountering them, throw an exception.BigDecimal
is not generally a good idea for salaris; instead, uselong
, and represent the atomic unit of finance for the target currency. i.e. someone who earns $2000 a month should get a long with200000
in it (that many cents, that is your salary). The problem with BD is that it is incredibly complicated and doesn't actually solve anything. For example, if someone takes a salary 'pay cut' and loses a third, then BD will just throw an exception (you can't divide by 3 without losing precision). You can solve that for example by giving BD some arbitrary precision but now we're right back on lack of precision thatdouble
suffers from, so it's usually better to explicitly encode rounding behaviour each and every time you need to do division. (i.e. it is not possible to write code that applies a 1/3 paycut. It is only possible to write code that applies a 1/3 paycut and then rounds down to the nearest cent, for example). Once you do anything 'interesting' with your BigDecimal, such as actually transfer the salary to the employee, the bank doesn't allow you to transfer fractional cents. If you prefer, use an object that represents a financial amount (it should pack in both the amount of atomary units, e.g. dollarcents, bitcoin satoshis, etc, as well as the currency type). There are libraries for this, such as joda-money.You are mixing stream based logic and for loops, and the end result is overly long, confusion, and inefficient as a result. Pick a side. Either [A] for loop and do all the work inside the loop (instead of: "Loop, make an entirely new list containing just active employees, THEN, take that list, loop, etc" - just loop ones), or [B] a stream, and do all the work in a single stream op. So,
.filter()
to restrict to active, then do NOT invokecollect(Collectors.toList())
, instead, keep going. map it to an object, and end it all with.sum()
.Your code as is will calculate invalid codes for every employee, which sounds quite inefficient. It's not relevant once you eliminate the concept of long-lived invalid-state, though.
BigDecimal.ZERO
is the idiomatic way to represent zero. Not relevant once you refactor tolong
or joda-money though.
What's the correct approach to refactor this code from a performance point of view?
That's.. silly. The objectively correct answer is that this is the wrong question to ask and anybody on a team that wastes more than 5 seconds on this should be sent back to the junior position. No company on the planet has more than a few tens of thousands of employees, so any attempt to 'improve performance' here will accomplish absolutely nothing whatsoever. For such tiny (a few tens of thousands, that's tiny, in computer terms) collections, performance doesn't work like that. CPU caching and hotspot vagary is the controlling factor. Even the JVM engineers that work on this stuff won't be able to just predict what code would theoretically run faster (and we're talking about reducing the runtime from 1 nanosecond to 0.999 nanosecond - utterly irrelevant in any case), it's crazy to expect a non-JVM-engineer to know or apply it. In any case, whatever the answer might be, run it on a different OS or a different VM or a different version and the answer will be different, further highlighting how it's beyond nuts to worry about it here.
The right move is to refactor such that the code is easy to read, easy to debug, easy to test, and is idiomatic (looks like code your average java programmer would write). That is why you should not re-calculate invalid codes every time, and that is why you should eliminate the possibility of long-lived invalid state. Not because 'it makes the code run faster' - but because it makes the code easier to follow, easier to test, easier to debug, and makes it look more like standard java. You want idiomatic code not just because that means your code is most easily read by others, but also because JVM engineers, when making the JVM more efficient, cater primarily to 'common patterns'. Hence, writing idiomatic code gets you performance benefits. Let's call this the "When in Rome, be like the Romans" rule.
Presumably you got asked this at a job interview and didn't get the job. There are three plausible options here:
[1] You misunderstood what the interviewer was asking you.
[2] The interviewer is a misguided, and probably somewhat arrogant, engineer who thinks there are correct answers to these questions. They are not aware of their own failings, for example, they do not realize that hardware and JVM impl controls here, performance-wise, and not the algorithm. Your best bet is to either become so experienced that you know the wrong answer they are looking for and give that, or, better yet, move on. You do not want to work here.
[3] The interviewer is not. An answer like the above ('but.. why are you asking me for the 'better performing code'? Assuming employee means what I think it means, there can't realistically be more than a few ten thousand so any time available to optimize this code should clearly not be spent here!') -- that would land you the job. A good interviewer doesn't ask questions like it's a school test, where there is one correct answer. Instead, they ask questions that draw out an interesting response; they ask questions that gives you the opportunity to talk about something you know. They are checking that you can think out of the box, have a sense of perspective, know a little more than the bare minimum job requirements, and are willing to speak your mind and can do so in a professional and productive manner. If you prefer, it was a 'trick' question and the real test is if you realized that the question was inappropriate for this code.
Answered By - rzwitserloot