Chatbot LLMs Gatotkaca.AI #Part 1 • Share SadewaWicak25

Introduction

Gatotkaca.AI is my mini-project where I developed a chatbot using LlamaIndex and Mistral models, designed to provide information about weather conditions in Indonesia.

Steps

I’ve broken it down into parts to help you understand how I developed the LLM-powered chatbot.

Webscraping Indonesia Informations
Implementation ETL (Extract Transformation Load)
Configuration on LLamaIndex
Finalization with implementation using Streamlit

Tools

In this part, there are part several tools you need to prepare to install to run this systems.

Numpy (pip install numpy)
Pandas (pip install pandas)
BeautifulSoup4 (pip install beautifulsoup4)
JupyterLab (pip install jupyterlab)

Actions

First of all, we need to know how many province capital in Indonesia country. To get that informations we can extract it from Wikipedia. I found that informations from this link

https://id.wikipedia.org/wiki/Daftar_ibu_kota_provinsi_di_Indonesia

2. To load data from the url and store it into variables, using this command

import requests

url = "https://id.wikipedia.org/wiki/Daftar_ibu_kota_provinsi_di_Indonesia"
response = requests.get(url)
wikisite = response.text
wikisite

3. After receiving the responses in HTML, we need to extract the tbody value using Beautiful Soup to simplify our data breakdown.

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(wikisite, 'html.parser')
tbody = soup.find('table').find('tbody')
tbody

4. There are some columns we need to store the value into our array and saved it into numpy files.


indProv = []

# Check if tbody exists
if tbody:
    # Iterate over the elements inside tbody
for tr in tbody.find_all('tr'):  # Assuming you want to get all rows
        # Get the text inside the td
td = tr.find_all('td')
if td and td[0].get_text(strip = True) != '-':
areas = td[6].get_text(strip = True)
cleaned_areas = re.sub(r'\[.*?\]', '', areas).strip()

capital = td[2].get_text(strip = True)
cleaned_capital = re.sub(r'\[.*?\]', '', capital).strip()

indProv.append({
  'id': td[0].get_text(strip = True),
  'nama': td[1].get_text(strip = True),
  'ibukota': cleaned_capital,
  'luas_wilayah': cleaned_areas,
  'ipm': td[7].get_text(strip = True)
})
else:
print("No tbody found")

5. Make sure that the total length of the array matches the number of capital provinces in Indonesia, which is 38.

indProv, len(indProv)

6. Some information is missing in the data, so we need to perform normalization to clean it up.

In the name Daerah Khusus Ibukota Jakarta the capital name is no data. Therefore, we need to change that data and replace it.

# there are missing values in ibukota jakarta
indProv[10]['ibukota'] = 'Jakarta'
print(indProv[10])

Replace Palangka Raya to palangkaraya because the API is only provided on palangkaraya

# there are missing values in ibukota jakarta
indProv[10]['ibukota'] = 'Jakarta'
print(indProv[10])

Remove data which not found in the API Weather data

indProv.remove(indProv[36])
len(indProv)

7. Saved it into numpy files

import numpy as np

indProv_array = np.array(indProv)
np.save('temp/ind_prov.npy', indProv_array)