A universal Wi-Fi fingerprint localization method based on machine learning and sample differences

Wi-Fi technology has become an important candidate for localization due to its low cost and no need of additional installation. The Wi-Fi fingerprint-based positioning is widely used because of its ready hardware and acceptable accuracy, especially with the current fingerprint localization algorithms based on Machine Learning (ML) and Deep Learning (DL). However, there exists two challenges. Firstly, the traditional ML methods train a specific classification model for each scene; therefore, it is hard to deploy and manage it on the cloud. Secondly, it is difficult to train an effective multi-classification model by using a small number of fingerprint samples. To solve these two problems, a novel binary classification model based on the samples’ differences is proposed in this paper. We divide the raw fingerprint pairs into positive and negative samples based on each pair’s distance. New relative features (e.g., sort features) are introduced to replace the traditional pair features which use the Media Access Control (MAC) address and Received Signal Strength (RSS). Finally, the boosting algorithm is used to train the classification model. The UJIndoorLoc dataset including the data from three different buildings is used to evaluate our proposed method. The preliminary results show that the floor success detection rate of the proposed method can reach 99.54% (eXtreme Gradient Boosting, XGBoost) and 99.22% (Gradient Boosting Decision Tree, GBDT), and the positioning error can reach 3.460 m (XGBoost) and 4.022 m (GBDT). Another important advantage of the proposed algorithm is that the model trained by one building’s data can be well applied to another building, which shows strong generalizable ability.


Introduction
The outdoor location service has increasingly matured with the rapid development of the Global Navigation Satellite System (GNSS) . However, GNSS fails to provide indoor positioning service due to its signal obstruction and attenuation. While indoor positioning has become more and more important in people's daily activities, such as shopping, parking, and health monitoring. Accordingly, many scholars have conducted considerable research on indoor positioning with various techniques, such as Wi-Fi, Bluetooth, geomagnetic localization, Radio Frequency Identification (RFID), ultra-wideband, wireless local area network, computer vision, light visible communication, and Pedestrian Dead Reckoning (PDR) assisted by accelerator and gyroscope (He & Chan, 2016;Naser and Li, 2021;Zhuang et al., 2018;Yang et al., 2015;El-Sheimy & Li, 2021;El-Sheimy & Youssef, 2020).
Among these techniques, Wi-Fi positioning has become a research hotspot due to its mature hardware and software ecology, low cost, and no need of extra deployment. Main Wi-Fi positioning algorithms include Access Point (AP) proximity-aware (Hodes et al., 1997), fingerprint-based positioning (Zhuang et al., 2016), and trilateration localization based on the signal propagation

Open Access
Satellite Navigation  (Bahl & Padmanabhan, 2000). But the fingerprinting algorithm is more widely used because it can achieve the highest positioning accuracy.
Currently, the neighbor point mismatch is a prime problem in Wi-Fi fingerprint-based positioning. The traditional solution calculates the similarity between the fingerprint RSS vector and the observation RSS vector using different indices, like the Euclidean distance (Kaemarungsi & Krishnamurthy, 2004), cosine similarity (Han et al., 2015), Pearson coefficient (Li et al., 2019), and others (Machaj et al., 2011). Most of these methods use the direct differential computation method by the means of RSS vectors. However, it is difficult to describe the complex nonlinear relationship between signal vectors accurately. Therefore, many scholars recently use Machine Learning (ML) and Deep Learning (DL) for neighbor point matching. It can be broadly divided into two groups One is the supervised learning methods which use various classification methods, like Random Forest (RF) (Lee et al., 2019), Decision Tree (DT) (Chanama & Wongwirat, 2018), Bayes (Chen et al., 2013), Support Vector Machine (SVM), Neural Network (NN) (Zhang et al., 2013;Esmond & Bernard, 2013), Convolutional Neural Network (CNN) (Shao et al., 2018) and other classification algorithms (Feng et al., 2014;Li et al., 2021). The other is unsupervised learning using the methods of clustering, K-Means (Chen et al., 2015), fuzzy cluster (Bi et al. 2018), Density-Based Spatial Clustering of Applications with Noise (DBSCAN) (Deng et al., 2018), etc.
These two groups of methods have their obvious weaknesses and strengths. The classification algorithms always require a high demand, which includes both sample quality and sample quantity. Considering the time and labor costs, we can easily find that classification, especially the multi-classification, may not be suitable for fingerprintbased positioning since the classifier training requires each of these categories has a large number of samples. Therefore, many methods for sample enhancement have been proposed, for example, crowdsourced data collection (Guo & Pun, 2019), interpolation methods for sample creation (Kolakowski, 2020), and the DL to increase the size of samples, in which the most common method is the Generative Adversarial Network (GAN) (Liu & Wang, 2020;Zou et al., 2020). But the data generated by this method has poor quality, and the generation model is hard to converge when using GAN.
The clustering algorithm also has some problems. Firstly, the computational complexity of the clustering is too high to be used in real-time positioning. Secondly, clustering is more applicable for zone localization, and the accuracy of the point localization using this algorithm is always low. In addition, most of the clustering algorithms require a known number of classes and some initial centers of the clusters, which makes it hard for practical use. The abnormal data has a greater effect on the final result when compared with other methods.
The above ML-based or DL-based methods all face the same problem. They use the APs' RSS values as the input features, but the RSS values have a strong relationship with the location of the fingerprint. The classifier trained by the fingerprint data of one building cannot be used in another building, sometimes even on another floor. It requires that each building or floor trains and manages its own classifier, which can cause some problems. The first and foremost problem is the model deployment in the servers for practical application. It is necessary to deploy a huge number of models and update the models periodically, which is costly. Another problem is the model management if there are many models on the cloud. It is hard to maintain effectively, and also requires countless resources for the operation of the whole cloud platform.
To solve the above problems, a novel method is proposed using the differences among the samples. To make full use of the differences, we adopt the relative features, like the repeated AP, the signal similarity, and the other features rather than the commonly used absolute features. The boosting algorithms of the eXtreme Gradient Boosting (XGBoost) and the Gradient Boosting Decision Tree (GBDT) are used in this paper for binary classification model training rather than the multi-classification, because they are widely used in binary classification and their performance is much better than others. The test datasets perform well by using the classifier trained by the same building's data, or another building's data. Figure 1 shows the proposed positioning system that involves two main phases, i.e., offline and online.

System components
In the offline phase, the main work is the fingerprint collection and the classification model training. To improve the quality of the samples, it is better to collect RSS values serval times in each fingerprint. After the RSS values are collected, all the samples are paired. If the two samples of one pair come from the same points or neighbor points, these pairs are regarded as positive pairs, like FP A1 and FP A2 which are collected in Point A in Fig. 1. If they come from different points and the distance between them is large enough, they belong to a negative pair, like FP A1 which is collected in Point A and FP B1 which is collected in Point B in Fig. 1. We choose the FPs that come from different fingerprint points as a negative pair in this paper. Then, new features which represent the difference between two samples in each pair are calculated. The new extracted features are the inputs of the classifier rather than the original MAC-RSS pairs. The boosting algorithm is used for classifier training. Some base classifiers are used for the training process, and the output of the previous classifier works as the input of the next classifier. The misclassified samples will be considered more in the next classification. A binary classification model will be the output at the end. The output of the model is the probability with which the observation and the fingerprint come from the same point or neighbor points.
In the positioning phase, some key fingerprint points which have the common MAC with the observation list are selected for the next calculation. Then, the features from these fingerprints and observation data are calculated. The features are input into the model trained in the offline phase. And the model will output the probability of each fingerprint point and the attribute (neighbor point or not) of each fingerprint point. Finally, the point which holds highest probability is the final localization result.

Feature selection
The traditional fingerprinting methods always use the MAC-RSS pairs as the features. This means the features have a strong relationship with the locations of the fingerprint points, which limits the use of the multiple classifiers. We proposed a method to use positive and negative pairs for classification. If the two samples of one pair come from the same point or neighbor points, these pairs are regarded as positive pairs. If they come from different points and their spacing is large enough, they are negative pairs, which is shown in Fig. 2. Then, the features which represent the difference between two samples in each pair are calculated.
The new features are divided into four types, i.e., the coincidence number features, the sort feature, the similarity feature, and the transposition feature. All these features are listed in Fig. 3. Figure 4 and the following equations show how to calculate these features.
The similarity features can be calculated by the following equations: a. Euclidean Distance: where rss (ap i inSA) and rss (ap i inSB) represent the RSS value of ap i in scanlist SA and SB respectively. m is the repeated MAC number of scanlist SA and SB. b. Cosine Similarity:  where Max(·) denotes the maximum function. d. Pearson's Coefficient: (3)  where SA → SB represents the sift operation from SA to SB, W (·) is the weight function, and Min(·) is the minimum function.
where i, j represents one AP's sort in AP scanlist SA and SB, respectively.

Removal of unreliable APs
There are many mobile phones or other mobile devices in the building, and they may have a great influence on fingerprint information collecting and online positioning. In addition, the proposed method requires the ordering information. It is necessary to delete these mobile APs firstly. The mobile MAC in fingerprint data is easy to be found and deleted by the statistical means, like the number of repeated occurrences, the cover area, the RSS values and so on.
It is hard to delete the mobile MAC in online positioning due to the limited information in the observation. There are two general solutions. The first one is comparing the current observation with the historical observation. Some abnormal MAC can be found. The second solution requires the system to maintain an abnormal field list to detect the abnormal MAC in real time. The abnormal field list may contain some obvious abnormal fields, like 'Mobile' , 'HUAWEI' , 'OPPO' , 'VIVO' , 'XIAOMI' , 'smartphone' , and so on.

RSS normalization
The RSS received by different types of devices is different because of the device heterogeneity. Thus, we must map them to a uniform range using Eq. (13) (Song and Wang 2017), which can mitigate the impact of the device heterogeneity to some extent.
where rss and rss std respectively represents the mean value and the standard value of the RSS list.

Feature enhancement
The total number of the new features is 23, which is limited for model training. Except for the whole RSS list, we also choose the top three list and the top five list to calculate these features respectively, and then the total number of the features is increased to 57. In the real test, the added features can improve the accuracy. However, some features are strongly correlated; therefore, a part of the useless features should be dropped in the next experiment.

Data declaration
The UJIndoorLoc Dataset (Torres-Sospedra et al. 2014) collection covers an area of 108,703 m 2 , including three buildings with 3-5 floors. The Wi-Fi data are collected by more than 20 collectors using 25 different types of smartphones, and the total number of AP is more than 500. Table 1 lists the detailed data information for each building, which tells the dataset diversity and complexity are high enough. The fingerprint points' distribution for each building is visualized in Fig. 5. And Table 2 gives the number of MAC repetitions between each building.

Experiments setting
The fingerprint data used for the classifier training is from one building, but the validation data may come from different buildings to test the adaptability of different classifiers. Table 3 shows the setting of the experiment with M representing the Model and T the Test.
For example, M0-T1 means the data for model training come from Building0 and the validation data come from Building1.

Experiment results
To validate the effectiveness of the proposed method, the results are compared with the results with other methods in the relevant articles. Many comparative experiments are conducted. We test the performance of the Nearest Neighbor (NN), DT, and Ensemble Learning (EL) algorithms, including the bagging algorithms (Bagging and RF) and the boosting algorithms (XGBoost and GBDT). To test the feasibility of the proposed method

Compared with existing methods
Since the UJIndoorLoc Dataset is widely used to test the performance of the fingerprint positioning algorithms, the result of the proposed method is compared with the results of other approaches listed in Table 4. We can see from the table that the proposed novel method has a better performance not only in the positioning error but also in the building and floor judgment accuracy. The first proposed method uses the XGBoost and achieves the mean positioning error of 3.42 m and the floor judgment accuracy of up to 99.40%. The second proposed method uses the GBDT and achieves the mean positioning error of 2.45 m and the floor judgment of 99.14%.

Comparison with popular algorithms
There are three main evaluation indices: floor detection accuracy, point matching accuracy, and mean positioning error. The point matching accuracy represents the rate of the final result matches the chosen testing fingerprint point.
The NN is one of the most popular methods in fingerprint positioning. We transfer the multi-classification into the binary classification in this paper, and the previous works show that tree models perform well in the binary classification. Thus, the DT and its enhanced algorithm-EL, including bagging algorithm and boosting algorithm, are chosen to evaluate the performance of the proposed method. The bagging algorithm includes the bagging and the RF, while the boosting algorithm includes XGBoost and GBDT. The detailed results are shown in Table 5. It is obvious that the performance of the single tree is much poorer than the EL. And the performance of the bagging algorithm is poorer than the boosting algorithm. As shown in Table 5, the positioning accuracy is slightly higher when using the GBDT. However, the XGBoost performs better in floor detection. Figure 6 shows the CDFs of different algorithms.

Classifiers trained by different buildings
The main objective of our method is to improve the adaptability of the classifier, making the classifier trained by one building usable in another building. The test data from three buildings are used to test the validity of the proposed method. The success rate of the floor judgment and point judgment, and the mean positioning error are recorded in the following tables. The performance of the DT, bagging algorithm, and boosting algorithm are tested, respectively. It is obvious that the boosting algorithm performs better than other methods, and the proposed method performs well even if the data come from different buildings. Tables 6, 7, 8 show the results of each  building, and Figs. 7,8,9 show the CDFs of each method.

Feature dimension reduction
23 features are chosen to test the proposed method. Table 9 shows the result by using the classifier trained by 23 features. To improve the dimension of the feature, we increase the number of the feature to 57, Table 6 shows its performance. When we use XGBoost and GBDT, the score of each feature can be output after training. To improve the accuracy of each classifier, Figs. 10, 11, 12 show the score of each feature. But the low-score features are different for different classifiers. Thus, some common low-score features are deleted. The remaining features are used to train a new classifier, and the performance of the new classifier also is tested. Table 10 shows the performance after deleting some features. We can find no obvious improvement in the validity after deleting the low-score features. Figures 10,11,12 show the score of each feature in different buildings.

Conclusion and future work
To improve the performance of the indoor fingerprintbased positioning, it is a trend to use the method of ML or DL. However, current methods using the MAC-RSS pairs as the features face many problems, like low scene adaptability and the accuracy of the localization model. Thus, we proposed a novel method to solve these problems. To improve the model generalizable ability, we divided the samples into positive pairs and negative pairs and calculated the relative features rather than the absolute features from these pairs. Some methods were used to enhance the dimension of the features. Then the binary classification was used to replace the multi-classification, and the boosting algorithm was used to improve the accuracy of the classification model. The open-source dataset-UJIndoorLoc Dataset was used to test the performance of the proposed method. The results show that the proposed method performs better in floor judgment success rate and positioning error when compared with the NN and other binary classification models. Further studies are necessary, including the development of a method to construct more effective positive samples and negative samples and the employment of the DL rather than ML (Figs. 13, 14).      The score of each feature using the data which comes from Building1 Fig. 12 The score of each feature using the data which comes from building2