Some weeks ago we encountered a technical issue in Commu that’s probably every programmers worst nightmare. A bug that appears out of nowhere without code changes, that’s really difficult to repeat steadily, that’s coming from the part of the framework that you simply have no way of circumventing and that has no common pattern other than appearing on Android devices. In this blog post I’m shedding light to my process of handling such bug, issues with OSS and explaining different debugging mechanisms.
Discovering the issue
We first found about the issue when some customers reported that it would be great if the map of Commu would open to their home city instead of Helsinki. Because Helsinki was our default location for cases when location permission is denied or GPS feature is off, we initially thought that users simply haven’t enabled location. However, soon we started getting reports that location was indeed on and enabled and then we started seeing this in our own team as well. It thus became obvious that there was a bug, but nothing had been changed recently and no new releases had been made! All of sudden we are experiencing big issues in our production app without doing anything that could have possibly caused the issue.
I started to franticly search for the cause. Nature of the issue however meant that most of the debugging methods were useless.
Version control history
When something suddenly starts failing, one common method is to first check from version control history what changes had been made between the current version and last working version. Usually this leads you directly to the root cause or at least gives you possibility to rollback conflicting changes. But this time the issue appeared out of nowhere so version control history was useless.
Debug with breakpoints and console.logs
Most surefire way to find and understand issues is to follow execution step by step. Unfortunately, to do this you need two things
- Knowledge about the source of issue, you can’t add breakpoints to every steps of the program
- Ability to repeat the issue
We had neither. It was obvious that issue was with location, but whether it’s coming from the component using the location or the helper used for querying helper was not known. However because the issue started without code changes, it was pretty clear it was related to how Expo handles location.
Issue was also really difficult to repeat. Sami had the issue happening constantly and I could not repeat it at all. We were in same location with same settings. Then later on it started working for Sami and happen to me, until it again disappeared. The other day I didn’t work for me no matter what I did, until a hour later it was just working perfectly again.
Search for changes in installed libraries
In our case, nothing had been changed or deployed so this was clearly not the case.
Searching if other library users have had the same issue
Apart from the standard Stack Overflow search, it’s worth your while to check from library GitHub page if other users have reported similar issues. Even if you don’t know that the issue is within library, changes are that you find out someone who has made the same mistake you did.
This revealed that other users had started to report this issue en masse. Two year old closed issue had racked up 17 comments in three days.
Good thing was that now we were no longer alone with the issue. From user comments we also confirmed the conflicting call, Location.getCurrentPositionAsync would randomly throw an error that location provider was unable even thought it was not and Location.hasServicesEnabledAsync would return true. The bad thing was that it meant you are now stuck waiting for Expo maintainers to fix this issue.
Error logging software
Any self-respecting software should have some external error logging system hooked in It’s a true life saver for finding issues. We user Sentry for this. Unfortunately this error was handled so we never got any Sentry errors. Once it the source of the issue was confirmed, we hooked a Sentry logging to it in order to better understand the issue.
This didn’t really help us forward as there was no common nominator whatsoever, other than issue appearing on Android devices. It did however give us better understanding on how common the issue was and let us easily confirm if our workarounds would work.
How to deal with unfixable issue in core feature
We had now found out the source of the issue. From GitHub issue discussion we also discovered that some user had managed to fix by using different location accuracy settings. We tried these, but Sentry quickly confirmed that they were not working.
We were now stuck with a broken location system on app that relies very heavily on the location. Expo is a closed system meaning that you cannot add any custom native code for it by yourself. Only Expo team would be able to fix this issue and only thing you could do about is to wait, not cool at all.
First thing we wanted to do was to see if the location library provided any alternative methods for fetching the location. We soon discovered that indeed it did! While accurate exact location was broken, there was a method for fetching the last known location. We then hooked this in started eagerly to see if error rates in Sentry would go down.
And they did, by large margin! Downside of our workaround was that the last known location could be null or it could be, well the last known location meaning users might in fact get info about notices in the city they where yesterday instead of where they are now. However, because location is not absolutely required for Commu, it was still way better than previous state.
To further reduce the pain caused by the issue we started to quickly redesign the app so that it can be used without location. Luckily for us, Commu is not 100% about the location, it’s just very important to provide most relevant information to the user but the user could still use the app, move the map and search notices without location. We then implemented all possible changes we could come up so that features would still be usable without location.
Finally, we came up with a backup plan. If location issue would worsen and not be fixed in sensible time, or we would face similar critical issue in the future we would create a copy of the code (branch in our case) then eject the copy from Expo. Most libraries that are as opinionated or locked as Expo provide ejecting, which means stripping all those Expo limitations, but also the tools provided by the Expo. In our case that would allow us to make custom implementation for fetching the location without the issues that Expo had. Separated codebase could used for the lifetime of the issue and then we could return back to Expo managed version so that ejecting would not be permanent. Obviously the massive downside would be that all development done in ejected version would have to be transferred, and possible even modified, when returning to Expo.
In a nutshell, our action points where:
- Search and implement second best way to do the thing, if possible
- Review what functionalities are affected and if they can be modified so that they would work at least at some level.
- Build a back-up plan for an absolute worst scenario
Expo has now managed to fix the issue and it will be made available for new SDK. The issue itself was in fact Google Play Services which had updated and Expo was fetching the location from it in not so elegant way.
Silverlining for us has bee that the issue forced us to improve UX for those users that don’t have location enabled or haven’t given the permission. And I got to write this blog post!