Clothing contributes approximately 5% of the CPI basket in the UK and currently is covered with manually collected data. We obtain web-scraped data from the online shopping websites of the main retailers in the clothing sector. We aim to increase product coverage with the high numbers of clothing items collected via web-scraping compared to manual price collection. This helps us to have more representative price data as we collect daily prices and improves granularity of the index since we can cover more various types of clothing.
We process web-scraped textual data using Natural Language Processing (NLP) and machine learning techniques to build a clothing price index. This presentation outlines three of the key pipelines we use to build the index: 1. Clothing Classification, 2. Product Grouping, 3. Index Run.
First, the classification pipeline builds a supervised machine learning model to produce a classification mapper which maps individual clothing products to narrowly defined clothing consumption segments such as “women’s dresses” or “men’s jeans”. Secondly, we create a product grouping mapper which maps each product to a product group using a rules-based method. This is crucial for the clothing price index due to the high churn with high product turnover rates and seasonality in the market. Thirdly, we create a clothing price index using multilateral index methods as they allow better use of the dynamic structure of web-scraped data with entering and leaving products.
In conclusion, this project will allow us to modernise UK consumer price statistics by making better use of new data sources and innovative methods.