The Office for National Statistics (ONS) have come across a lot of challenges when experimenting with using web-scraped price data for price index generation. The challenges posed by the size of the data, the frequency at which the data come from, and the quality of the data caused by the quality of the web-scrapers, just to name a few.

In this talk, we discuss our most recent work on tackling the challenge caused by the quality of the web-scrapers. As an illustrative example, we use the US Amazon web-scraped ‘electronics’ products data to present a set of cluster analysis using the product names information only. The data set is obtained from The Billion Prices Project website.

In our proposed hierarchical divisive clustering framework, the following issues will be discussed:

  • How do we decide on the total number of clusters?
  • Which cluster to split further?
  • How to split a cluster?
  • When should the algorithm terminate?

